[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=341438=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-341438 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 11/Nov/19 19:11 Start Date: 11/Nov/19 19:11 Worklog Time Spent: 10m Work Description: boyuanzz commented on pull request #9778: [BEAM-7013] Update BigQueryHllSketchCompatibilityIT to cover empty sketch cases URL: https://github.com/apache/beam/pull/9778 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 341438) Time Spent: 37h 20m (was: 37h 10m) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 37h 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=339648=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339648 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 07/Nov/19 00:29 Start Date: 07/Nov/19 00:29 Worklog Time Spent: 10m Work Description: robinyqiu commented on issue #9778: [BEAM-7013] Update BigQueryHllSketchCompatibilityIT to cover empty sketch cases URL: https://github.com/apache/beam/pull/9778#issuecomment-550564200 Run Java PostCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 339648) Time Spent: 37h 10m (was: 37h) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 37h 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=339542=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339542 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 06/Nov/19 18:47 Start Date: 06/Nov/19 18:47 Worklog Time Spent: 10m Work Description: robinyqiu commented on issue #9778: [BEAM-7013] Update BigQueryHllSketchCompatibilityIT to cover empty sketch cases URL: https://github.com/apache/beam/pull/9778#issuecomment-550447746 Run Java_Examples_Dataflow PreCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 339542) Time Spent: 37h (was: 36h 50m) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 37h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=339541=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339541 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 06/Nov/19 18:47 Start Date: 06/Nov/19 18:47 Worklog Time Spent: 10m Work Description: robinyqiu commented on issue #9778: [BEAM-7013] Update BigQueryHllSketchCompatibilityIT to cover empty sketch cases URL: https://github.com/apache/beam/pull/9778#issuecomment-550447688 Run Java PreCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 339541) Time Spent: 36h 50m (was: 36h 40m) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 36h 50m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=338301=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-338301 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 04/Nov/19 18:45 Start Date: 04/Nov/19 18:45 Worklog Time Spent: 10m Work Description: robinyqiu commented on pull request #9778: [BEAM-7013] Update BigQueryHllSketchCompatibilityIT to cover empty sketch cases URL: https://github.com/apache/beam/pull/9778#discussion_r342208431 ## File path: sdks/java/extensions/zetasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCount.java ## @@ -107,6 +109,20 @@ // Cannot be instantiated. This class is intended to be a namespace only. private HllCount() {} + /** + * Returns the sketch stored as bytes in the input {@code ByteBuffer}. If the input {@code Review comment: Done. Agree that this does sounds more clear! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 338301) Time Spent: 36h 40m (was: 36.5h) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 36h 40m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=338241=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-338241 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 04/Nov/19 17:44 Start Date: 04/Nov/19 17:44 Worklog Time Spent: 10m Work Description: zfraa commented on pull request #9778: [BEAM-7013] Update BigQueryHllSketchCompatibilityIT to cover empty sketch cases URL: https://github.com/apache/beam/pull/9778#discussion_r342170633 ## File path: sdks/java/extensions/zetasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCount.java ## @@ -107,6 +109,20 @@ // Cannot be instantiated. This class is intended to be a namespace only. private HllCount() {} + /** + * Returns the sketch stored as bytes in the input {@code ByteBuffer}. If the input {@code Review comment: Was confused here at first (when not being aware that this is a static method, one can read it as "the user passes in an empty/reusable ByteBuffer and the library returns the serialized sketch in that ByteBuffer"). To avoid that not-so-alert readers think that ;), how about "Converts the passed-in sketch from ByteBuffer to byte[], mapping null ByteBuffers (representing empty sketches) to empty byte[]. Utility method to convert sketches materialized with ZetaSQL/BigQuery to valid input for HllCount transforms."? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 338241) Time Spent: 36.5h (was: 36h 20m) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 36.5h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=337612=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-337612 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 01/Nov/19 23:16 Start Date: 01/Nov/19 23:16 Worklog Time Spent: 10m Work Description: robinyqiu commented on issue #9778: [BEAM-7013] Update BigQueryHllSketchCompatibilityIT to cover empty sketch cases URL: https://github.com/apache/beam/pull/9778#issuecomment-548981025 > is there any way we could make the conversion functors (parseQueryResultToByteArray, and the inlined one for the other direction) available to users as utility objects or methods? I have considered this but unfortunately we cannot do that, because in the same function users might want to parse other fields. However I do find a way to extract part of the logic into its own function. See the `getSketchFromByteBuffer` function. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 337612) Time Spent: 36h 20m (was: 36h 10m) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 36h 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=337122=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-337122 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 01/Nov/19 00:45 Start Date: 01/Nov/19 00:45 Worklog Time Spent: 10m Work Description: robinyqiu commented on pull request #9778: [BEAM-7013] Update BigQueryHllSketchCompatibilityIT to cover empty sketch cases URL: https://github.com/apache/beam/pull/9778#discussion_r341417808 ## File path: sdks/java/extensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/BigQueryHllSketchCompatibilityIT.java ## @@ -65,23 +66,32 @@ private static final List TEST_DATA = Arrays.asList("Apple", "Orange", "Banana", "Orange"); - // Data Table: used by testReadSketchFromBigQuery()) + // Data Table: used by tests reading sketches from BigQuery // Schema: only one STRING field named "data". - // Content: prepopulated with 4 rows: "Apple", "Orange", "Banana", "Orange" - private static final String DATA_TABLE_ID = "hll_data"; private static final String DATA_FIELD_NAME = "data"; private static final String DATA_FIELD_TYPE = "STRING"; private static final String QUERY_RESULT_FIELD_NAME = "sketch"; - private static final Long EXPECTED_COUNT = 3L; - // Sketch Table: used by testWriteSketchToBigQuery() + // Content: prepopulated with 4 rows: "Apple", "Orange", "Banana", "Orange" + private static final String DATA_TABLE_ID_NON_EMPTY = "hll_data_non_empty"; + private static final Long EXPECTED_COUNT_NON_EMPTY = 3L; + + // Content: empty + private static final String DATA_TABLE_ID_EMPTY = "hll_data_empty"; Review comment: Yes it does and I have tried it (although it is not mentioned in the BigQuery documentation). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 337122) Time Spent: 36h 10m (was: 36h) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 36h 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=337118=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-337118 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 01/Nov/19 00:29 Start Date: 01/Nov/19 00:29 Worklog Time Spent: 10m Work Description: robinyqiu commented on pull request #9778: [BEAM-7013] Update BigQueryHllSketchCompatibilityIT to cover empty sketch cases URL: https://github.com/apache/beam/pull/9778#discussion_r341415540 ## File path: sdks/java/extensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/BigQueryHllSketchCompatibilityIT.java ## @@ -126,22 +145,49 @@ public static void deleteDataset() throws Exception { } /** - * Test that HLL++ sketch computed in BigQuery can be processed by Beam. Hll sketch is computed by - * {@code HLL_COUNT.INIT} in BigQuery and read into Beam; the test verifies that we can run {@link - * HllCount.MergePartial} and {@link HllCount.Extract} on the sketch in Beam to get the correct - * estimated count. + * Test that non-empty HLL++ sketch computed in BigQuery can be processed by Beam. + * + * Hll sketch is computed by {@code HLL_COUNT.INIT} in BigQuery and read into Beam; the test Review comment: Fixed! Thanks for the link, that's a good read. :) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 337118) Time Spent: 36h (was: 35h 50m) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 36h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=337110=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-337110 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 01/Nov/19 00:23 Start Date: 01/Nov/19 00:23 Worklog Time Spent: 10m Work Description: robinyqiu commented on pull request #9778: [BEAM-7013] Update BigQueryHllSketchCompatibilityIT to cover empty sketch cases URL: https://github.com/apache/beam/pull/9778#discussion_r341414587 ## File path: sdks/java/extensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/BigQueryHllSketchCompatibilityIT.java ## @@ -126,22 +145,49 @@ public static void deleteDataset() throws Exception { } /** - * Test that HLL++ sketch computed in BigQuery can be processed by Beam. Hll sketch is computed by - * {@code HLL_COUNT.INIT} in BigQuery and read into Beam; the test verifies that we can run {@link - * HllCount.MergePartial} and {@link HllCount.Extract} on the sketch in Beam to get the correct - * estimated count. + * Test that non-empty HLL++ sketch computed in BigQuery can be processed by Beam. + * + * Hll sketch is computed by {@code HLL_COUNT.INIT} in BigQuery and read into Beam; the test + * verifies that we can run {@link HllCount.MergePartial} and {@link HllCount.Extract} on the + * sketch in Beam to get the correct estimated count. + */ + @Test + public void testReadNonEmptySketchFromBigQuery() { +readSketchFromBigQuery(DATA_TABLE_ID_NON_EMPTY, EXPECTED_COUNT_NON_EMPTY); + } + + /** + * Test that empty HLL++ sketch computed in BigQuery can be processed by Beam. + * + * Hll sketch is computed by {@code HLL_COUNT.INIT} in BigQuery and read into Beam; the test + * verifies that we can run {@link HllCount.MergePartial} and {@link HllCount.Extract} on the + * sketch in Beam to get the correct estimated count. */ @Test - public void testReadSketchFromBigQuery() { -String tableSpec = String.format("%s.%s", DATASET_ID, DATA_TABLE_ID); + public void testReadEmptySketchFromBigQuery() { +readSketchFromBigQuery(DATA_TABLE_ID_EMPTY, EXPECTED_COUNT_EMPTY); + } + + private void readSketchFromBigQuery(String tableId, Long expectedCount) { +String tableSpec = String.format("%s.%s", DATASET_ID, tableId); String query = String.format( "SELECT HLL_COUNT.INIT(%s) AS %s FROM %s", DATA_FIELD_NAME, QUERY_RESULT_FIELD_NAME, tableSpec); + SerializableFunction parseQueryResultToByteArray = -(SchemaAndRecord schemaAndRecord) -> -// BigQuery BYTES type corresponds to Java java.nio.ByteBuffer type -((ByteBuffer) schemaAndRecord.getRecord().get(QUERY_RESULT_FIELD_NAME)).array(); +input -> { + // BigQuery BYTES type corresponds to Java java.nio.ByteBuffer type + ByteBuffer sketch = (ByteBuffer) input.getRecord().get(QUERY_RESULT_FIELD_NAME); + if (sketch == null) { +// Empty sketch is represented by null in BigQuery and by empty byte array in Beam +return new byte[0]; + } else { +byte[] result = new byte[sketch.remaining()]; Review comment: Exactly. We know that is the case by looking into `Avro`'s implementation, but compiler does not know that, and it gives a warning if we use `.array()`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 337110) Time Spent: 35h 50m (was: 35h 40m) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 35h 50m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=337108=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-337108 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 01/Nov/19 00:21 Start Date: 01/Nov/19 00:21 Worklog Time Spent: 10m Work Description: robinyqiu commented on pull request #9778: [BEAM-7013] Update BigQueryHllSketchCompatibilityIT to cover empty sketch cases URL: https://github.com/apache/beam/pull/9778#discussion_r341414290 ## File path: sdks/java/extensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/BigQueryHllSketchCompatibilityIT.java ## @@ -96,28 +106,37 @@ public static void prepareDatasetAndDataTable() throws Exception { Review comment: Ah, nice catch! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 337108) Time Spent: 35h 40m (was: 35.5h) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 35h 40m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=330565=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330565 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 18/Oct/19 15:20 Start Date: 18/Oct/19 15:20 Worklog Time Spent: 10m Work Description: zfraa commented on pull request #9778: [BEAM-7013] Update BigQueryHllSketchCompatibilityIT to cover empty sketch cases URL: https://github.com/apache/beam/pull/9778#discussion_r336525227 ## File path: sdks/java/extensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/BigQueryHllSketchCompatibilityIT.java ## @@ -126,22 +145,49 @@ public static void deleteDataset() throws Exception { } /** - * Test that HLL++ sketch computed in BigQuery can be processed by Beam. Hll sketch is computed by - * {@code HLL_COUNT.INIT} in BigQuery and read into Beam; the test verifies that we can run {@link - * HllCount.MergePartial} and {@link HllCount.Extract} on the sketch in Beam to get the correct - * estimated count. + * Test that non-empty HLL++ sketch computed in BigQuery can be processed by Beam. + * + * Hll sketch is computed by {@code HLL_COUNT.INIT} in BigQuery and read into Beam; the test + * verifies that we can run {@link HllCount.MergePartial} and {@link HllCount.Extract} on the + * sketch in Beam to get the correct estimated count. + */ + @Test + public void testReadNonEmptySketchFromBigQuery() { +readSketchFromBigQuery(DATA_TABLE_ID_NON_EMPTY, EXPECTED_COUNT_NON_EMPTY); + } + + /** + * Test that empty HLL++ sketch computed in BigQuery can be processed by Beam. Review comment: (same here: "Test that an empty...", "The HLL sketch...") This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 330565) Time Spent: 35h 20m (was: 35h 10m) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 35h 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=330563=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330563 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 18/Oct/19 15:20 Start Date: 18/Oct/19 15:20 Worklog Time Spent: 10m Work Description: zfraa commented on pull request #9778: [BEAM-7013] Update BigQueryHllSketchCompatibilityIT to cover empty sketch cases URL: https://github.com/apache/beam/pull/9778#discussion_r336535846 ## File path: sdks/java/extensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/BigQueryHllSketchCompatibilityIT.java ## @@ -96,28 +106,37 @@ public static void prepareDatasetAndDataTable() throws Exception { BIGQUERY_CLIENT.createNewDataset(PROJECT_ID, DATASET_ID); -// Create Data Table TableSchema dataTableSchema = new TableSchema() .setFields( Collections.singletonList( new TableFieldSchema().setName(DATA_FIELD_NAME).setType(DATA_FIELD_TYPE))); -Table dataTable = + +Table dataTableNonEmpty = new Table() .setSchema(dataTableSchema) .setTableReference( new TableReference() .setProjectId(PROJECT_ID) .setDatasetId(DATASET_ID) -.setTableId(DATA_TABLE_ID)); -BIGQUERY_CLIENT.createNewTable(PROJECT_ID, DATASET_ID, dataTable); - +.setTableId(DATA_TABLE_ID_NON_EMPTY)); +BIGQUERY_CLIENT.createNewTable(PROJECT_ID, DATASET_ID, dataTableNonEmpty); // Prepopulate test data to Data Table Review comment: "Prepopulate data tables with test data" This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 330563) Time Spent: 35h 10m (was: 35h) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 35h 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=330564=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330564 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 18/Oct/19 15:20 Start Date: 18/Oct/19 15:20 Worklog Time Spent: 10m Work Description: zfraa commented on pull request #9778: [BEAM-7013] Update BigQueryHllSketchCompatibilityIT to cover empty sketch cases URL: https://github.com/apache/beam/pull/9778#discussion_r336534833 ## File path: sdks/java/extensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/BigQueryHllSketchCompatibilityIT.java ## @@ -96,28 +106,37 @@ public static void prepareDatasetAndDataTable() throws Exception { Review comment: Maybe rename to "...AndDataTables()"? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 330564) Time Spent: 35h 20m (was: 35h 10m) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 35h 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=330566=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330566 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 18/Oct/19 15:20 Start Date: 18/Oct/19 15:20 Worklog Time Spent: 10m Work Description: zfraa commented on pull request #9778: [BEAM-7013] Update BigQueryHllSketchCompatibilityIT to cover empty sketch cases URL: https://github.com/apache/beam/pull/9778#discussion_r336524398 ## File path: sdks/java/extensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/BigQueryHllSketchCompatibilityIT.java ## @@ -126,22 +145,49 @@ public static void deleteDataset() throws Exception { } /** - * Test that HLL++ sketch computed in BigQuery can be processed by Beam. Hll sketch is computed by - * {@code HLL_COUNT.INIT} in BigQuery and read into Beam; the test verifies that we can run {@link - * HllCount.MergePartial} and {@link HllCount.Extract} on the sketch in Beam to get the correct - * estimated count. + * Test that non-empty HLL++ sketch computed in BigQuery can be processed by Beam. + * + * Hll sketch is computed by {@code HLL_COUNT.INIT} in BigQuery and read into Beam; the test Review comment: Nit: "Test that a HLL++ sketch...", and "The HLL sketch is computed...". Otherwise, LGTM! Also, all Javadoc should be in third person ("Tests that..." instead of "Test that"; see https://www.oracle.com/technetwork/articles/java/index-137868.html, "Use 3rd person..."). Sorry that I missed this in the first version of this code! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 330566) Time Spent: 35.5h (was: 35h 20m) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 35.5h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=330561=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330561 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 18/Oct/19 15:20 Start Date: 18/Oct/19 15:20 Worklog Time Spent: 10m Work Description: zfraa commented on pull request #9778: [BEAM-7013] Update BigQueryHllSketchCompatibilityIT to cover empty sketch cases URL: https://github.com/apache/beam/pull/9778#discussion_r336542489 ## File path: sdks/java/extensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/BigQueryHllSketchCompatibilityIT.java ## @@ -65,23 +66,32 @@ private static final List TEST_DATA = Arrays.asList("Apple", "Orange", "Banana", "Orange"); - // Data Table: used by testReadSketchFromBigQuery()) + // Data Table: used by tests reading sketches from BigQuery // Schema: only one STRING field named "data". - // Content: prepopulated with 4 rows: "Apple", "Orange", "Banana", "Orange" - private static final String DATA_TABLE_ID = "hll_data"; private static final String DATA_FIELD_NAME = "data"; private static final String DATA_FIELD_TYPE = "STRING"; private static final String QUERY_RESULT_FIELD_NAME = "sketch"; - private static final Long EXPECTED_COUNT = 3L; - // Sketch Table: used by testWriteSketchToBigQuery() + // Content: prepopulated with 4 rows: "Apple", "Orange", "Banana", "Orange" + private static final String DATA_TABLE_ID_NON_EMPTY = "hll_data_non_empty"; + private static final Long EXPECTED_COUNT_NON_EMPTY = 3L; + + // Content: empty + private static final String DATA_TABLE_ID_EMPTY = "hll_data_empty"; Review comment: Does the aggregation (HLL_COUNT.INIT) over an empty table return a NULL sketch, as expected? I.e., did the test fail before you modified 'parseQueryResultToByteArray' to deal with NULLs? I think it should (according to https://plx.corp.google.com/scripts2/script_5d._6fe78c__2144_8a38_883d24fc4a60, last SELECT), but double-checking. I'd wish we could add an assert, but that doesn't work with the utility method. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 330561) Time Spent: 34h 50m (was: 34h 40m) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 34h 50m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=330562=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330562 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 18/Oct/19 15:20 Start Date: 18/Oct/19 15:20 Worklog Time Spent: 10m Work Description: zfraa commented on pull request #9778: [BEAM-7013] Update BigQueryHllSketchCompatibilityIT to cover empty sketch cases URL: https://github.com/apache/beam/pull/9778#discussion_r336529565 ## File path: sdks/java/extensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/BigQueryHllSketchCompatibilityIT.java ## @@ -126,22 +145,49 @@ public static void deleteDataset() throws Exception { } /** - * Test that HLL++ sketch computed in BigQuery can be processed by Beam. Hll sketch is computed by - * {@code HLL_COUNT.INIT} in BigQuery and read into Beam; the test verifies that we can run {@link - * HllCount.MergePartial} and {@link HllCount.Extract} on the sketch in Beam to get the correct - * estimated count. + * Test that non-empty HLL++ sketch computed in BigQuery can be processed by Beam. + * + * Hll sketch is computed by {@code HLL_COUNT.INIT} in BigQuery and read into Beam; the test + * verifies that we can run {@link HllCount.MergePartial} and {@link HllCount.Extract} on the + * sketch in Beam to get the correct estimated count. + */ + @Test + public void testReadNonEmptySketchFromBigQuery() { +readSketchFromBigQuery(DATA_TABLE_ID_NON_EMPTY, EXPECTED_COUNT_NON_EMPTY); + } + + /** + * Test that empty HLL++ sketch computed in BigQuery can be processed by Beam. + * + * Hll sketch is computed by {@code HLL_COUNT.INIT} in BigQuery and read into Beam; the test + * verifies that we can run {@link HllCount.MergePartial} and {@link HllCount.Extract} on the + * sketch in Beam to get the correct estimated count. */ @Test - public void testReadSketchFromBigQuery() { -String tableSpec = String.format("%s.%s", DATASET_ID, DATA_TABLE_ID); + public void testReadEmptySketchFromBigQuery() { +readSketchFromBigQuery(DATA_TABLE_ID_EMPTY, EXPECTED_COUNT_EMPTY); + } + + private void readSketchFromBigQuery(String tableId, Long expectedCount) { +String tableSpec = String.format("%s.%s", DATASET_ID, tableId); String query = String.format( "SELECT HLL_COUNT.INIT(%s) AS %s FROM %s", DATA_FIELD_NAME, QUERY_RESULT_FIELD_NAME, tableSpec); + SerializableFunction parseQueryResultToByteArray = -(SchemaAndRecord schemaAndRecord) -> -// BigQuery BYTES type corresponds to Java java.nio.ByteBuffer type -((ByteBuffer) schemaAndRecord.getRecord().get(QUERY_RESULT_FIELD_NAME)).array(); +input -> { + // BigQuery BYTES type corresponds to Java java.nio.ByteBuffer type + ByteBuffer sketch = (ByteBuffer) input.getRecord().get(QUERY_RESULT_FIELD_NAME); + if (sketch == null) { +// Empty sketch is represented by null in BigQuery and by empty byte array in Beam +return new byte[0]; + } else { +byte[] result = new byte[sketch.remaining()]; Review comment: why not `return sketch.array()` as previously? Since we can't be 100% sure that the ByteBuffer is backed by an accessible array? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 330562) Time Spent: 35h (was: 34h 50m) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 35h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=327185=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-327185 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 12/Oct/19 02:27 Start Date: 12/Oct/19 02:27 Worklog Time Spent: 10m Work Description: robinyqiu commented on issue #9778: [BEAM-7013] Update BigQueryHllSketchCompatibilityIT to cover empty sketch cases URL: https://github.com/apache/beam/pull/9778#issuecomment-541275070 See all the 4 test cases pass here: https://builds.apache.org/job/beam_PostCommit_Java_PR/237/testReport/org.apache.beam.sdk.extensions.zetasketch/BigQueryHllSketchCompatibilityIT/ This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 327185) Time Spent: 34h 40m (was: 34.5h) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 34h 40m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=327140=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-327140 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 11/Oct/19 23:43 Start Date: 11/Oct/19 23:43 Worklog Time Spent: 10m Work Description: robinyqiu commented on issue #9778: [BEAM-7013] Update BigQueryHllSketchCompatibilityIT to cover empty sketch cases URL: https://github.com/apache/beam/pull/9778#issuecomment-541257592 Run Java PostCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 327140) Time Spent: 34.5h (was: 34h 20m) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 34.5h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=327139=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-327139 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 11/Oct/19 23:41 Start Date: 11/Oct/19 23:41 Worklog Time Spent: 10m Work Description: robinyqiu commented on pull request #9778: [BEAM-7013] Update BigQueryHllSketchCompatibilityIT to cover empty sketch cases URL: https://github.com/apache/beam/pull/9778 r: @zfraa Post-Commit Tests Status (on master branch) Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark --- | --- | --- | --- | --- | --- | --- | --- Go | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/) Java | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/) Python | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Python37/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python37/lastCompletedBuild/) | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Py_VR_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Py_VR_Dataflow/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Py_ValCont/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Py_ValCont/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PreCommit_Python2_PVR_Flink_Cron/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PreCommit_Python2_PVR_Flink_Cron/lastCompletedBuild/)[![Build
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=310910=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-310910 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 11/Sep/19 18:23 Start Date: 11/Sep/19 18:23 Worklog Time Spent: 10m Work Description: boyuanzz commented on pull request #9519: [BEAM-7013] Use a 0-length byte array to represent empty sketch in HllCount URL: https://github.com/apache/beam/pull/9519 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 310910) Time Spent: 34h 10m (was: 34h) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 34h 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=310111=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-310111 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 10/Sep/19 20:53 Start Date: 10/Sep/19 20:53 Worklog Time Spent: 10m Work Description: zfraa commented on pull request #9519: [BEAM-7013] Use a 0-length byte array to represent empty sketch in HllCount URL: https://github.com/apache/beam/pull/9519#discussion_r322957148 ## File path: sdks/java/extensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/HllCountTest.java ## @@ -302,12 +304,53 @@ public void testMergePartialGlobally_SingletonInput() { @Test @Category(NeedsRunner.class) - public void testMergePartialGlobally_EmptyInput() { + public void testMergePartialGlobally_SingletonInputEmptySketch() { PCollection result = -p.apply(Create.empty(TypeDescriptor.of(byte[].class))) + p.apply(Create.of(EMPTY_SKETCH)).apply(HllCount.MergePartial.globally()); + +PAssert.thatSingleton(result).isEqualTo(EMPTY_SKETCH); +p.run(); + } + + @Test + @Category(NeedsRunner.class) + public void testMergePartialGlobally_MergeWithEmptySketch() { +PCollection result = +p.apply(Create.of(LONGS_SKETCH, EMPTY_SKETCH)).apply(HllCount.MergePartial.globally()); + +PAssert.thatSingleton(result).isEqualTo(LONGS_SKETCH); +p.run(); + } + + @Test + @Category(NeedsRunner.class) + public void testMergePartialGlobally_MergeMultipleEmptySketches() { +PCollection result = +p.apply(Create.of(EMPTY_SKETCH, EMPTY_SKETCH)).apply(HllCount.MergePartial.globally()); + +PAssert.thatSingleton(result).isEqualTo(EMPTY_SKETCH); +p.run(); + } + + @Test + @Category(NeedsRunner.class) + public void testMergePartialGlobally_MergeWithSketchOfEmptySet() { +PCollection result = +p.apply(Create.of(LONGS_SKETCH, LONGS_SKETCH_OF_EMPTY_SET)) +.apply(HllCount.MergePartial.globally()); + +PAssert.thatSingleton(result).isEqualTo(LONGS_SKETCH); +p.run(); + } + + @Test + @Category(NeedsRunner.class) + public void testMergePartialGlobally_MergeEmptySketchWithSketchOfEmptySet() { +PCollection result = +p.apply(Create.of(EMPTY_SKETCH, LONGS_SKETCH_OF_EMPTY_SET)) .apply(HllCount.MergePartial.globally()); -PAssert.that(result).empty(); +PAssert.thatSingleton(result).isEqualTo(LONGS_SKETCH_OF_EMPTY_SET); Review comment: Good point, SG. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 310111) Time Spent: 34h (was: 33h 50m) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 34h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=310088=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-310088 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 10/Sep/19 20:34 Start Date: 10/Sep/19 20:34 Worklog Time Spent: 10m Work Description: robinyqiu commented on pull request #9519: [BEAM-7013] Use a 0-length byte array to represent empty sketch in HllCount URL: https://github.com/apache/beam/pull/9519#discussion_r322949267 ## File path: sdks/java/extensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/HllCountTest.java ## @@ -302,12 +304,53 @@ public void testMergePartialGlobally_SingletonInput() { @Test @Category(NeedsRunner.class) - public void testMergePartialGlobally_EmptyInput() { + public void testMergePartialGlobally_SingletonInputEmptySketch() { PCollection result = -p.apply(Create.empty(TypeDescriptor.of(byte[].class))) + p.apply(Create.of(EMPTY_SKETCH)).apply(HllCount.MergePartial.globally()); + +PAssert.thatSingleton(result).isEqualTo(EMPTY_SKETCH); +p.run(); + } + + @Test + @Category(NeedsRunner.class) + public void testMergePartialGlobally_MergeWithEmptySketch() { +PCollection result = +p.apply(Create.of(LONGS_SKETCH, EMPTY_SKETCH)).apply(HllCount.MergePartial.globally()); + +PAssert.thatSingleton(result).isEqualTo(LONGS_SKETCH); +p.run(); + } + + @Test + @Category(NeedsRunner.class) + public void testMergePartialGlobally_MergeMultipleEmptySketches() { +PCollection result = +p.apply(Create.of(EMPTY_SKETCH, EMPTY_SKETCH)).apply(HllCount.MergePartial.globally()); + +PAssert.thatSingleton(result).isEqualTo(EMPTY_SKETCH); +p.run(); + } + + @Test + @Category(NeedsRunner.class) + public void testMergePartialGlobally_MergeWithSketchOfEmptySet() { +PCollection result = +p.apply(Create.of(LONGS_SKETCH, LONGS_SKETCH_OF_EMPTY_SET)) +.apply(HllCount.MergePartial.globally()); + +PAssert.thatSingleton(result).isEqualTo(LONGS_SKETCH); +p.run(); + } + + @Test + @Category(NeedsRunner.class) + public void testMergePartialGlobally_MergeEmptySketchWithSketchOfEmptySet() { +PCollection result = +p.apply(Create.of(EMPTY_SKETCH, LONGS_SKETCH_OF_EMPTY_SET)) .apply(HllCount.MergePartial.globally()); -PAssert.that(result).empty(); +PAssert.thatSingleton(result).isEqualTo(LONGS_SKETCH_OF_EMPTY_SET); Review comment: > this is testing an implementation detail (there are other valid return values here). Yes. But the result is exposed to user anyway. So I prefer keeping it as it is. If we decide to return other values (byte[0]) in the future, we will change this unit test as well. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 310088) Time Spent: 33h 50m (was: 33h 40m) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 33h 50m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=310085=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-310085 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 10/Sep/19 20:30 Start Date: 10/Sep/19 20:30 Worklog Time Spent: 10m Work Description: zfraa commented on issue #9519: [BEAM-7013] Use a 0-length byte array to represent empty sketch in HllCount URL: https://github.com/apache/beam/pull/9519#issuecomment-530106641 Still LGTM! On the test naming with underscores: Yep, zetasketch does this -- we follow the principle outlined e.g. here: https://osherove.com/blog/2005/4/3/naming-standards-for-unit-tests.html, which is also consistent with internal Java Style. But let's follow Beam style here! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 310085) Time Spent: 33h 40m (was: 33.5h) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 33h 40m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=310084=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-310084 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 10/Sep/19 20:29 Start Date: 10/Sep/19 20:29 Worklog Time Spent: 10m Work Description: robinyqiu commented on pull request #9519: [BEAM-7013] Use a 0-length byte array to represent empty sketch in HllCount URL: https://github.com/apache/beam/pull/9519#discussion_r322947030 ## File path: sdks/java/extensions/zetasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCountMergePartialFn.java ## @@ -54,10 +54,15 @@ private HllCountMergePartialFn() {} return null; } + @Nullable @Override public HyperLogLogPlusPlus addInput( @Nullable HyperLogLogPlusPlus accumulator, byte[] input) { Review comment: @boyuanzz PTAL at the added logging This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 310084) Time Spent: 33.5h (was: 33h 20m) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 33.5h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=310083=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-310083 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 10/Sep/19 20:27 Start Date: 10/Sep/19 20:27 Worklog Time Spent: 10m Work Description: zfraa commented on pull request #9519: [BEAM-7013] Use a 0-length byte array to represent empty sketch in HllCount URL: https://github.com/apache/beam/pull/9519#discussion_r322946215 ## File path: sdks/java/extensions/zetasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCountMergePartialFn.java ## @@ -54,10 +54,15 @@ private HllCountMergePartialFn() {} return null; } + @Nullable @Override public HyperLogLogPlusPlus addInput( @Nullable HyperLogLogPlusPlus accumulator, byte[] input) { Review comment: LGTM! Maybe Boyuan can also have a quick look at the logging? Not familiar with what's usual in Beam. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 310083) Time Spent: 33h 20m (was: 33h 10m) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 33h 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=310032=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-310032 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 10/Sep/19 18:50 Start Date: 10/Sep/19 18:50 Worklog Time Spent: 10m Work Description: robinyqiu commented on pull request #9519: [BEAM-7013] Use a 0-length byte array to represent empty sketch in HllCount URL: https://github.com/apache/beam/pull/9519#discussion_r322906525 ## File path: sdks/java/extensions/zetasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCountMergePartialFn.java ## @@ -54,10 +54,15 @@ private HllCountMergePartialFn() {} return null; } + @Nullable @Override public HyperLogLogPlusPlus addInput( @Nullable HyperLogLogPlusPlus accumulator, byte[] input) { Review comment: > I would lean towards avoiding user errors, since every error avoided is something that users don't need to revise their pipeline over, and is an issue that is not escalated to us. Agreed. Actually I figured out that we can accept nulls and leave a log warning to suggest replacement with byte[0]. Made that change. PTAL. > Also, if users need to filter their input and replace nulls with byte[0], is that streamed (resp. folded into another pass over the data) or does it result in an extra-pass over the data? That depends on their pipeline implementation. If users do that in the `PTransform` where null is created (e.g. [here](https://github.com/robinyqiu/beam/blob/3b6a628c9ad0fbf63b7c1f7d355dbc8cf5219eb2/sdks/java/extensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/BigQueryHllSketchCompatibilityIT.java#L144) in BigQueryIO), then it will not result in an extra-pass. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 310032) Time Spent: 33h 10m (was: 33h) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 33h 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=310025=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-310025 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 10/Sep/19 18:41 Start Date: 10/Sep/19 18:41 Worklog Time Spent: 10m Work Description: robinyqiu commented on pull request #9519: [BEAM-7013] Use a 0-length byte array to represent empty sketch in HllCount URL: https://github.com/apache/beam/pull/9519#discussion_r322902745 ## File path: sdks/java/extensions/zetasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCountMergePartialFn.java ## @@ -54,10 +54,16 @@ private HllCountMergePartialFn() {} return null; } + @Nullable @Override public HyperLogLogPlusPlus addInput( @Nullable HyperLogLogPlusPlus accumulator, byte[] input) { -if (accumulator == null) { +if (input == null) { + throw new NullPointerException( Review comment: Nice catch! (But this is outdated since I changed the behavior.) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 310025) Time Spent: 33h (was: 32h 50m) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 33h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=310024=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-310024 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 10/Sep/19 18:38 Start Date: 10/Sep/19 18:38 Worklog Time Spent: 10m Work Description: robinyqiu commented on pull request #9519: [BEAM-7013] Use a 0-length byte array to represent empty sketch in HllCount URL: https://github.com/apache/beam/pull/9519#discussion_r322901323 ## File path: sdks/java/extensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/HllCountTest.java ## @@ -370,4 +464,14 @@ public void testExtractPerKey() { .containsInAnyOrder(Arrays.asList(KV.of("k", INTS1_ESTIMATE), KV.of("k", INTS2_ESTIMATE))); p.run(); } + + @Test + @Category(NeedsRunner.class) + public void testExtractPerKey_EmptySketch() { Review comment: Nice catch of the missing test! Done. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 310024) Time Spent: 32h 50m (was: 32h 40m) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 32h 50m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=310021=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-310021 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 10/Sep/19 18:35 Start Date: 10/Sep/19 18:35 Worklog Time Spent: 10m Work Description: robinyqiu commented on pull request #9519: [BEAM-7013] Use a 0-length byte array to represent empty sketch in HllCount URL: https://github.com/apache/beam/pull/9519#discussion_r322899925 ## File path: sdks/java/extensions/zetasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCount.java ## @@ -346,7 +354,14 @@ private Extract() {} @ProcessElement Review comment: Thanks for the suggestion! Done. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 310021) Time Spent: 32h 40m (was: 32.5h) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 32h 40m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=310015=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-310015 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 10/Sep/19 18:25 Start Date: 10/Sep/19 18:25 Worklog Time Spent: 10m Work Description: robinyqiu commented on pull request #9519: [BEAM-7013] Use a 0-length byte array to represent empty sketch in HllCount URL: https://github.com/apache/beam/pull/9519#discussion_r322895610 ## File path: sdks/java/extensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/HllCountTest.java ## @@ -281,12 +283,12 @@ public void testMergePartialGlobally() { @Test @Category(NeedsRunner.class) - public void testMergePartialGlobally_MergeWithSketchForEmptySet() { + public void testMergePartialGlobally_EmptyInput() { Review comment: Nice catch! I am following the style in the [zetasketch library](https://github.com/google/zetasketch/blob/4aa44e9cb543d766318b919fc6bb359bdf7b0809/javatests/com/google/zetasketch/HyperLogLogPlusPlusTest.java#L97), which I think is a google style? But I am not sure if this complies to Beam style. So I have changed it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 310015) Time Spent: 32.5h (was: 32h 20m) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 32.5h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=310002=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-310002 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 10/Sep/19 18:05 Start Date: 10/Sep/19 18:05 Worklog Time Spent: 10m Work Description: robinyqiu commented on pull request #9519: [BEAM-7013] Use a 0-length byte array to represent empty sketch in HllCount URL: https://github.com/apache/beam/pull/9519#discussion_r322886862 ## File path: sdks/java/extensions/zetasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCountInitFn.java ## @@ -57,6 +57,11 @@ void setPrecision(int precision) { this.precision = precision; } + @Override + public byte[] defaultValue() { +return new byte[0]; Review comment: Ah I see. You are comparing byte[0] representation with the proto representation. Done! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 310002) Time Spent: 32h 20m (was: 32h 10m) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 32h 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=309991=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-309991 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 10/Sep/19 17:43 Start Date: 10/Sep/19 17:43 Worklog Time Spent: 10m Work Description: robinyqiu commented on pull request #9519: [BEAM-7013] Use a 0-length byte array to represent empty sketch in HllCount URL: https://github.com/apache/beam/pull/9519#discussion_r322876783 ## File path: sdks/java/extensions/zetasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCount.java ## @@ -237,10 +240,11 @@ private Builder(HllCountInitFn initFn) { * PCollection} and returns a {@code PCollection} which consists of the HLL++ * sketch computed from the elements in the input {@code PCollection}. * - * Returns an empty output {@code PCollection} if the input {@code PCollection} is empty. + * Returns a singleton {@code PCollection} with an "empty sketch" (0-length byte array) if + * the input {@code PCollection} is empty. Review comment: > Or does a perKey aggregation work exactly like a group-by, i.e., it can never be associated with an empty aggregation? Yes. (Under the hood it uses a transform called `GroupByKey`.) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 309991) Time Spent: 32h 10m (was: 32h) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 32h 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=309921=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-309921 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 10/Sep/19 15:57 Start Date: 10/Sep/19 15:57 Worklog Time Spent: 10m Work Description: zfraa commented on pull request #9519: [BEAM-7013] Use a 0-length byte array to represent empty sketch in HllCount URL: https://github.com/apache/beam/pull/9519#discussion_r322824226 ## File path: sdks/java/extensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/HllCountTest.java ## @@ -370,4 +464,14 @@ public void testExtractPerKey() { .containsInAnyOrder(Arrays.asList(KV.of("k", INTS1_ESTIMATE), KV.of("k", INTS2_ESTIMATE))); p.run(); } + + @Test + @Category(NeedsRunner.class) + public void testExtractPerKey_EmptySketch() { Review comment: Since the Extract implementations for Global and PerKey are different, would test both types of empty sketches for both cases. Otherwise, very nice test coverage! :) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 309921) Time Spent: 31h 50m (was: 31h 40m) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 31h 50m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=309922=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-309922 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 10/Sep/19 15:57 Start Date: 10/Sep/19 15:57 Worklog Time Spent: 10m Work Description: zfraa commented on pull request #9519: [BEAM-7013] Use a 0-length byte array to represent empty sketch in HllCount URL: https://github.com/apache/beam/pull/9519#discussion_r322657271 ## File path: sdks/java/extensions/zetasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCountMergePartialFn.java ## @@ -54,10 +54,15 @@ private HllCountMergePartialFn() {} return null; } + @Nullable @Override public HyperLogLogPlusPlus addInput( @Nullable HyperLogLogPlusPlus accumulator, byte[] input) { Review comment: Let's leave things as they are in this PR, so that this can be merged! Whether we should do any follow-ups: I would lean towards avoiding user errors, since every error avoided is something that users don't need to revise their pipeline over, and is an issue that is not escalated to us. Also, if users need to filter their input and replace nulls with byte[0], is that streamed (resp. folded into another pass over the data) or does it result in an extra-pass over the data? But I don't feel strongly about it and I'll let you make the call, and as you say, we can always change it later if there's a need for it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 309922) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 31h 50m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=309919=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-309919 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 10/Sep/19 15:57 Start Date: 10/Sep/19 15:57 Worklog Time Spent: 10m Work Description: zfraa commented on pull request #9519: [BEAM-7013] Use a 0-length byte array to represent empty sketch in HllCount URL: https://github.com/apache/beam/pull/9519#discussion_r322654460 ## File path: sdks/java/extensions/zetasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCountInitFn.java ## @@ -57,6 +57,11 @@ void setPrecision(int precision) { this.precision = precision; } + @Override + public byte[] defaultValue() { +return new byte[0]; Review comment: Nits: s/Result/The result + would add "because we cannot..., and because it's more compact. Otherwise looks great! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 309919) Time Spent: 31h 50m (was: 31h 40m) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 31h 50m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=309920=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-309920 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 10/Sep/19 15:57 Start Date: 10/Sep/19 15:57 Worklog Time Spent: 10m Work Description: zfraa commented on pull request #9519: [BEAM-7013] Use a 0-length byte array to represent empty sketch in HllCount URL: https://github.com/apache/beam/pull/9519#discussion_r322760086 ## File path: sdks/java/extensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/HllCountTest.java ## @@ -302,12 +304,53 @@ public void testMergePartialGlobally_SingletonInput() { @Test @Category(NeedsRunner.class) - public void testMergePartialGlobally_EmptyInput() { + public void testMergePartialGlobally_SingletonInputEmptySketch() { PCollection result = -p.apply(Create.empty(TypeDescriptor.of(byte[].class))) + p.apply(Create.of(EMPTY_SKETCH)).apply(HllCount.MergePartial.globally()); + +PAssert.thatSingleton(result).isEqualTo(EMPTY_SKETCH); +p.run(); + } + + @Test + @Category(NeedsRunner.class) + public void testMergePartialGlobally_MergeWithEmptySketch() { +PCollection result = +p.apply(Create.of(LONGS_SKETCH, EMPTY_SKETCH)).apply(HllCount.MergePartial.globally()); + +PAssert.thatSingleton(result).isEqualTo(LONGS_SKETCH); +p.run(); + } + + @Test + @Category(NeedsRunner.class) + public void testMergePartialGlobally_MergeMultipleEmptySketches() { +PCollection result = +p.apply(Create.of(EMPTY_SKETCH, EMPTY_SKETCH)).apply(HllCount.MergePartial.globally()); + +PAssert.thatSingleton(result).isEqualTo(EMPTY_SKETCH); +p.run(); + } + + @Test + @Category(NeedsRunner.class) + public void testMergePartialGlobally_MergeWithSketchOfEmptySet() { +PCollection result = +p.apply(Create.of(LONGS_SKETCH, LONGS_SKETCH_OF_EMPTY_SET)) +.apply(HllCount.MergePartial.globally()); + +PAssert.thatSingleton(result).isEqualTo(LONGS_SKETCH); +p.run(); + } + + @Test + @Category(NeedsRunner.class) + public void testMergePartialGlobally_MergeEmptySketchWithSketchOfEmptySet() { +PCollection result = +p.apply(Create.of(EMPTY_SKETCH, LONGS_SKETCH_OF_EMPTY_SET)) .apply(HllCount.MergePartial.globally()); -PAssert.that(result).empty(); +PAssert.thatSingleton(result).isEqualTo(LONGS_SKETCH_OF_EMPTY_SET); Review comment: Nit: this is testing an implementation detail (there are other valid return values here). Maybe apply an extract and verify that the result is zero, because there, there's only one valid answer? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 309920) Time Spent: 31h 50m (was: 31h 40m) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 31h 50m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=309924=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-309924 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 10/Sep/19 15:57 Start Date: 10/Sep/19 15:57 Worklog Time Spent: 10m Work Description: zfraa commented on pull request #9519: [BEAM-7013] Use a 0-length byte array to represent empty sketch in HllCount URL: https://github.com/apache/beam/pull/9519#discussion_r322661239 ## File path: sdks/java/extensions/zetasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCount.java ## @@ -237,10 +240,11 @@ private Builder(HllCountInitFn initFn) { * PCollection} and returns a {@code PCollection} which consists of the HLL++ * sketch computed from the elements in the input {@code PCollection}. * - * Returns an empty output {@code PCollection} if the input {@code PCollection} is empty. + * Returns a singleton {@code PCollection} with an "empty sketch" (0-length byte array) if + * the input {@code PCollection} is empty. Review comment: Could you add the same comment to the perKey method below? Or does a perKey aggregation work exactly like a group-by, i.e., it can never be associated with an empty aggregation? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 309924) Time Spent: 32h (was: 31h 50m) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 32h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=309923=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-309923 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 10/Sep/19 15:57 Start Date: 10/Sep/19 15:57 Worklog Time Spent: 10m Work Description: zfraa commented on pull request #9519: [BEAM-7013] Use a 0-length byte array to represent empty sketch in HllCount URL: https://github.com/apache/beam/pull/9519#discussion_r322656124 ## File path: sdks/java/extensions/zetasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCountMergePartialFn.java ## @@ -54,10 +54,15 @@ private HllCountMergePartialFn() {} return null; } + @Nullable Review comment: Right, same confusion on my side as below (byte[] empty sketch vs. HyperLogLogPlusPlus empty sketch). LG! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 309923) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 31h 50m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=309918=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-309918 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 10/Sep/19 15:57 Start Date: 10/Sep/19 15:57 Worklog Time Spent: 10m Work Description: zfraa commented on pull request #9519: [BEAM-7013] Use a 0-length byte array to represent empty sketch in HllCount URL: https://github.com/apache/beam/pull/9519#discussion_r322662678 ## File path: sdks/java/extensions/zetasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCount.java ## @@ -346,7 +354,14 @@ private Extract() {} @ProcessElement Review comment: (Comment belongs further up, but I can't comment on the unchanged lines:) Would mention the corner case of empty aggregations somewhere in the 'Extract' javadoc as well. E.g., "When extracting from an empty aggregation (i.e., a byte array of length 0), the result returned is 0." This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 309918) Time Spent: 31h 40m (was: 31.5h) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 31h 40m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=309667=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-309667 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 10/Sep/19 09:47 Start Date: 10/Sep/19 09:47 Worklog Time Spent: 10m Work Description: zfraa commented on pull request #9519: [BEAM-7013] Use a 0-length byte array to represent empty sketch in HllCount URL: https://github.com/apache/beam/pull/9519#discussion_r322649846 ## File path: sdks/java/extensions/zetasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCountMergePartialFn.java ## @@ -94,9 +99,9 @@ private HllCountMergePartialFn() {} @Override public byte[] extractOutput(@Nullable HyperLogLogPlusPlus accumulator) { if (accumulator == null) { - throw new IllegalStateException( - "HllCountMergePartialFn.extractOutput() should not be called on a null accumulator."); + return new byte[0]; Review comment: Got it, thanks! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 309667) Time Spent: 31.5h (was: 31h 20m) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 31.5h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=309660=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-309660 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 10/Sep/19 09:31 Start Date: 10/Sep/19 09:31 Worklog Time Spent: 10m Work Description: zfraa commented on pull request #9519: [BEAM-7013] Use a 0-length byte array to represent empty sketch in HllCount URL: https://github.com/apache/beam/pull/9519#discussion_r322642519 ## File path: sdks/java/extensions/zetasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCountMergePartialFn.java ## @@ -54,10 +54,15 @@ private HllCountMergePartialFn() {} return null; } + @Nullable @Override public HyperLogLogPlusPlus addInput( @Nullable HyperLogLogPlusPlus accumulator, byte[] input) { -if (accumulator == null) { +if (input == null) { + throw new NullPointerException("Null is not a valid sketch."); +} else if (input.length == 0) { + return accumulator; +} else if (accumulator == null) { Review comment: Right, sorry for missing that :) (Can be resolved). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 309660) Time Spent: 31h 20m (was: 31h 10m) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 31h 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=309407=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-309407 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 10/Sep/19 00:31 Start Date: 10/Sep/19 00:31 Worklog Time Spent: 10m Work Description: boyuanzz commented on pull request #9519: [BEAM-7013] Use a 0-length byte array to represent empty sketch in HllCount URL: https://github.com/apache/beam/pull/9519#discussion_r322504225 ## File path: sdks/java/extensions/zetasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCountMergePartialFn.java ## @@ -54,10 +54,16 @@ private HllCountMergePartialFn() {} return null; } + @Nullable @Override public HyperLogLogPlusPlus addInput( @Nullable HyperLogLogPlusPlus accumulator, byte[] input) { -if (accumulator == null) { +if (input == null) { + throw new NullPointerException( Review comment: Consider using `checkArgument`? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 309407) Time Spent: 31h (was: 30h 50m) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 31h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=309408=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-309408 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 10/Sep/19 00:31 Start Date: 10/Sep/19 00:31 Worklog Time Spent: 10m Work Description: boyuanzz commented on pull request #9519: [BEAM-7013] Use a 0-length byte array to represent empty sketch in HllCount URL: https://github.com/apache/beam/pull/9519#discussion_r322505023 ## File path: sdks/java/extensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/HllCountTest.java ## @@ -281,12 +283,12 @@ public void testMergePartialGlobally() { @Test @Category(NeedsRunner.class) - public void testMergePartialGlobally_MergeWithSketchForEmptySet() { + public void testMergePartialGlobally_EmptyInput() { Review comment: emmm...I don't think underscore `_` is a naming convention for non const thing. Maybe consider "With"? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 309408) Time Spent: 31h 10m (was: 31h) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 31h 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=309265=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-309265 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 09/Sep/19 21:30 Start Date: 09/Sep/19 21:30 Worklog Time Spent: 10m Work Description: robinyqiu commented on issue #9519: [BEAM-7013] Use a 0-length byte array to represent empty sketch in HllCount URL: https://github.com/apache/beam/pull/9519#issuecomment-529675162 Hi @zfraa , I think the test cases you mentioned above are already covered! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 309265) Time Spent: 30h 50m (was: 30h 40m) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 30h 50m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=309259=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-309259 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 09/Sep/19 21:19 Start Date: 09/Sep/19 21:19 Worklog Time Spent: 10m Work Description: robinyqiu commented on pull request #9519: [BEAM-7013] Use a 0-length byte array to represent empty sketch in HllCount URL: https://github.com/apache/beam/pull/9519#discussion_r322440348 ## File path: sdks/java/extensions/zetasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCountMergePartialFn.java ## @@ -54,10 +54,15 @@ private HllCountMergePartialFn() {} return null; } + @Nullable @Override public HyperLogLogPlusPlus addInput( @Nullable HyperLogLogPlusPlus accumulator, byte[] input) { Review comment: The `@Nullable` annotation is on the `accumulator` parameter, which can be null and we are handling that properly without throwing an exception. > tl;dr: why not handle nulls instead of throwing? There are pros and cons of supporting nulls: * The pro is that we can save the users from exceptions, as you mentioned * The cons are 1) then we will have two different representations for ''empty sketch"; and 2) I feel like if we accept nulls as input then we are encouraging users to produce nullable output (and use `NullableCoder`) from its upstream transform, which is more costly in terms of encoding/decoding and more error prone. Currently I slightly prefer not accepting nulls as "empty sketches". What is your opinion? A good thing if we keep the implementation as it is: we can always change it later to support nulls and it will be backwards compatible (but we cannot go the other way). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 309259) Time Spent: 30h 40m (was: 30.5h) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 30h 40m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=309253=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-309253 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 09/Sep/19 21:19 Start Date: 09/Sep/19 21:19 Worklog Time Spent: 10m Work Description: robinyqiu commented on pull request #9519: [BEAM-7013] Use a 0-length byte array to represent empty sketch in HllCount URL: https://github.com/apache/beam/pull/9519#discussion_r322341428 ## File path: sdks/java/extensions/zetasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCountMergePartialFn.java ## @@ -54,10 +54,15 @@ private HllCountMergePartialFn() {} return null; } + @Nullable @Override public HyperLogLogPlusPlus addInput( @Nullable HyperLogLogPlusPlus accumulator, byte[] input) { -if (accumulator == null) { +if (input == null) { + throw new NullPointerException("Null is not a valid sketch."); +} else if (input.length == 0) { + return accumulator; +} else if (accumulator == null) { Review comment: The `accumulator` is of type `HyperLogLogPlusPlus` but not `byte[]` so I don't think we can do that here. (And we cannot use `byte[]` for accumulator because of the cost of serialization/deserialization.) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 309253) Time Spent: 30h 20m (was: 30h 10m) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 30h 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=309257=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-309257 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 09/Sep/19 21:19 Start Date: 09/Sep/19 21:19 Worklog Time Spent: 10m Work Description: robinyqiu commented on pull request #9519: [BEAM-7013] Use a 0-length byte array to represent empty sketch in HllCount URL: https://github.com/apache/beam/pull/9519#discussion_r322397926 ## File path: sdks/java/extensions/zetasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCountMergePartialFn.java ## @@ -54,10 +54,15 @@ private HllCountMergePartialFn() {} return null; } + @Nullable Review comment: The return value of this function is an accumulator of type `HyperLogLogPlusPlus` and it can be `null`. This annotation was missing from the last PR so I am adding it back here. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 309257) Time Spent: 30.5h (was: 30h 20m) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 30.5h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=309254=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-309254 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 09/Sep/19 21:19 Start Date: 09/Sep/19 21:19 Worklog Time Spent: 10m Work Description: robinyqiu commented on pull request #9519: [BEAM-7013] Use a 0-length byte array to represent empty sketch in HllCount URL: https://github.com/apache/beam/pull/9519#discussion_r322342971 ## File path: sdks/java/extensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/HllCountTest.java ## @@ -86,13 +88,6 @@ LONGS_SKETCH = hll.serializeToByteArray(); } - private static final byte[] LONGS_EMPTY_SKETCH; - - static { -HyperLogLogPlusPlus hll = new HyperLogLogPlusPlus.Builder().buildForLongs(); -LONGS_EMPTY_SKETCH = hll.serializeToByteArray(); - } - Review comment: The current implementation can handles this case. I agree that keeping tests for this case is valuable, so I have added it back and added a couple more. :) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 309254) Time Spent: 30h 20m (was: 30h 10m) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 30h 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=309256=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-309256 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 09/Sep/19 21:19 Start Date: 09/Sep/19 21:19 Worklog Time Spent: 10m Work Description: robinyqiu commented on pull request #9519: [BEAM-7013] Use a 0-length byte array to represent empty sketch in HllCount URL: https://github.com/apache/beam/pull/9519#discussion_r322400143 ## File path: sdks/java/extensions/zetasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCountMergePartialFn.java ## @@ -94,9 +99,9 @@ private HllCountMergePartialFn() {} @Override public byte[] extractOutput(@Nullable HyperLogLogPlusPlus accumulator) { if (accumulator == null) { - throw new IllegalStateException( - "HllCountMergePartialFn.extractOutput() should not be called on a null accumulator."); + return new byte[0]; Review comment: The reason is that the superclass `CombineFn` has a default [implementation](https://github.com/robinyqiu/beam/blob/3b6a628c9ad0fbf63b7c1f7d355dbc8cf5219eb2/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/Combine.java#L455-L458): ``` @Override public OutputT defaultValue() { return extractOutput(createAccumulator()); } ``` For `MergePartial`, this implementation already gives the correct result: `createAccumulator()` returns `null` and `extractOutput(null)` returns `new byte[0]`. But for `Init`, the output of this implementation will be a valid empty sketch for each type (e.g. `LONGS_PROTO_OF_EMPTY_SKETCH` as you mentioned in the next comment), which is not what we want, so we need to override it there. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 309256) Time Spent: 30.5h (was: 30h 20m) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 30.5h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=309258=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-309258 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 09/Sep/19 21:19 Start Date: 09/Sep/19 21:19 Worklog Time Spent: 10m Work Description: robinyqiu commented on pull request #9519: [BEAM-7013] Use a 0-length byte array to represent empty sketch in HllCount URL: https://github.com/apache/beam/pull/9519#discussion_r322459772 ## File path: sdks/java/extensions/zetasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCountInitFn.java ## @@ -57,6 +57,11 @@ void setPrecision(int precision) { this.precision = precision; } + @Override + public byte[] defaultValue() { +return new byte[0]; Review comment: Thanks for the suggestion. Done! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 309258) Time Spent: 30h 40m (was: 30.5h) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 30h 40m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=309255=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-309255 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 09/Sep/19 21:19 Start Date: 09/Sep/19 21:19 Worklog Time Spent: 10m Work Description: robinyqiu commented on pull request #9519: [BEAM-7013] Use a 0-length byte array to represent empty sketch in HllCount URL: https://github.com/apache/beam/pull/9519#discussion_r322446265 ## File path: sdks/java/extensions/zetasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCount.java ## @@ -346,7 +348,13 @@ private Extract() {} @ProcessElement public void processElement( @Element byte[] sketch, OutputReceiver receiver) { - receiver.output(HyperLogLogPlusPlus.forProto(sketch).result()); +if (sketch == null) { + throw new NullPointerException("Null is not a valid sketch."); Review comment: This error message is much clearer! Done! (Replied to your comment about supporting null in the next thread.) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 309255) Time Spent: 30.5h (was: 30h 20m) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 30.5h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=309147=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-309147 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 09/Sep/19 18:43 Start Date: 09/Sep/19 18:43 Worklog Time Spent: 10m Work Description: robinyqiu commented on pull request #9447: [WIP] [BEAM-7013] Handle null input in HllCount.MergePartial and HllCount.Extract URL: https://github.com/apache/beam/pull/9447 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 309147) Time Spent: 30h 10m (was: 30h) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 30h 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=309036=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-309036 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 09/Sep/19 16:22 Start Date: 09/Sep/19 16:22 Worklog Time Spent: 10m Work Description: zfraa commented on pull request #9519: [BEAM-7013] Use a 0-length byte array to represent empty sketch in HllCount URL: https://github.com/apache/beam/pull/9519#discussion_r322251413 ## File path: sdks/java/extensions/zetasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCount.java ## @@ -346,7 +348,13 @@ private Extract() {} @ProcessElement public void processElement( @Element byte[] sketch, OutputReceiver receiver) { - receiver.output(HyperLogLogPlusPlus.forProto(sketch).result()); +if (sketch == null) { + throw new NullPointerException("Null is not a valid sketch."); Review comment: [Edit: see comment about @Nullable below first before making any changes to error messaging] How about (for the error message): "Expected a valid sketch or an empty byte array (for empty sketches), but found null"? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 309036) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 30h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=309035=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-309035 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 09/Sep/19 16:22 Start Date: 09/Sep/19 16:22 Worklog Time Spent: 10m Work Description: zfraa commented on pull request #9519: [BEAM-7013] Use a 0-length byte array to represent empty sketch in HllCount URL: https://github.com/apache/beam/pull/9519#discussion_r322333589 ## File path: sdks/java/extensions/zetasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCountMergePartialFn.java ## @@ -94,9 +99,9 @@ private HllCountMergePartialFn() {} @Override public byte[] extractOutput(@Nullable HyperLogLogPlusPlus accumulator) { if (accumulator == null) { - throw new IllegalStateException( - "HllCountMergePartialFn.extractOutput() should not be called on a null accumulator."); + return new byte[0]; Review comment: We don't need a defaultValue() implementation in this class because the accumulator is different and can encode "we haven't seen any input yet..." -- correct? (Ideally, we'd have more symmetry between Init and MergePartial, but that's a detail). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 309035) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 30h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=309034=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-309034 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 09/Sep/19 16:22 Start Date: 09/Sep/19 16:22 Worklog Time Spent: 10m Work Description: zfraa commented on pull request #9519: [BEAM-7013] Use a 0-length byte array to represent empty sketch in HllCount URL: https://github.com/apache/beam/pull/9519#discussion_r322267553 ## File path: sdks/java/extensions/zetasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCountMergePartialFn.java ## @@ -54,10 +54,15 @@ private HllCountMergePartialFn() {} return null; } + @Nullable @Override public HyperLogLogPlusPlus addInput( @Nullable HyperLogLogPlusPlus accumulator, byte[] input) { -if (accumulator == null) { +if (input == null) { + throw new NullPointerException("Null is not a valid sketch."); +} else if (input.length == 0) { + return accumulator; +} else if (accumulator == null) { Review comment: Did you consider to change the "empty accumulator" representation from null to byte[0] as well? Pros/Cons? (It might be conceptually easier to just have one representation for empty sketches -- the same internally and externally; but I don't see any large benefits to doing the change, so...) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 309034) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 30h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=309032=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-309032 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 09/Sep/19 16:22 Start Date: 09/Sep/19 16:22 Worklog Time Spent: 10m Work Description: zfraa commented on pull request #9519: [BEAM-7013] Use a 0-length byte array to represent empty sketch in HllCount URL: https://github.com/apache/beam/pull/9519#discussion_r322268995 ## File path: sdks/java/extensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/HllCountTest.java ## @@ -86,13 +88,6 @@ LONGS_SKETCH = hll.serializeToByteArray(); } - private static final byte[] LONGS_EMPTY_SKETCH; - - static { -HyperLogLogPlusPlus hll = new HyperLogLogPlusPlus.Builder().buildForLongs(); -LONGS_EMPTY_SKETCH = hll.serializeToByteArray(); - } - Review comment: We still need to handle this case as well! I.e., some empty sketches will be represented as byte[0], but not all. => Would keep a few tests with this, maybe give it a more specific name (LONGS_PROTO_OF_EMPTY_SKETCH, and the one above ZERO_BYTES_EMPTY_SKETCH..?) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 309032) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 30h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=309030=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-309030 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 09/Sep/19 16:22 Start Date: 09/Sep/19 16:22 Worklog Time Spent: 10m Work Description: zfraa commented on pull request #9519: [BEAM-7013] Use a 0-length byte array to represent empty sketch in HllCount URL: https://github.com/apache/beam/pull/9519#discussion_r322250509 ## File path: sdks/java/extensions/zetasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCount.java ## @@ -346,7 +348,13 @@ private Extract() {} @ProcessElement public void processElement( @Element byte[] sketch, OutputReceiver receiver) { - receiver.output(HyperLogLogPlusPlus.forProto(sketch).result()); +if (sketch == null) { + throw new NullPointerException("Null is not a valid sketch."); +} else if (sketch.length == 0) { + receiver.output(0L); Review comment: Yay, nice! :D This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 309030) Time Spent: 30h (was: 29h 50m) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 30h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=309031=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-309031 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 09/Sep/19 16:22 Start Date: 09/Sep/19 16:22 Worklog Time Spent: 10m Work Description: zfraa commented on pull request #9519: [BEAM-7013] Use a 0-length byte array to represent empty sketch in HllCount URL: https://github.com/apache/beam/pull/9519#discussion_r322264216 ## File path: sdks/java/extensions/zetasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCountMergePartialFn.java ## @@ -54,10 +54,15 @@ private HllCountMergePartialFn() {} return null; } + @Nullable @Override public HyperLogLogPlusPlus addInput( @Nullable HyperLogLogPlusPlus accumulator, byte[] input) { Review comment: tl;dr: why not handle nulls instead of throwing? I didn't find any sources on the exact implied semantics of @Nullable, but I would tend to assume if a parameter is annotated with @Nullable, the method handles it benignly if it is actually null, vs. throwing an exception. I would do one of the following two things: - Either remove the @Nullable annotation and keep throwing below; (again, not feeling strongly about this) - Or -- I think we can safely assume that if a null is passed, it's supposed to be an empty sketch: maybe a BQ sketch that made it through importing without conversion, ... . We have the means to support this smoothly by just treating nulls like byte[0] -- why not do this and save the users some exceptions? (would need to be consistently across methods) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 309031) Time Spent: 30h (was: 29h 50m) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 30h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=309029=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-309029 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 09/Sep/19 16:22 Start Date: 09/Sep/19 16:22 Worklog Time Spent: 10m Work Description: zfraa commented on pull request #9519: [BEAM-7013] Use a 0-length byte array to represent empty sketch in HllCount URL: https://github.com/apache/beam/pull/9519#discussion_r322257693 ## File path: sdks/java/extensions/zetasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCountMergePartialFn.java ## @@ -54,10 +54,15 @@ private HllCountMergePartialFn() {} return null; } + @Nullable Review comment: Why nullable if we never return null? Might be from a previous version of this PR? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 309029) Time Spent: 29h 50m (was: 29h 40m) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 29h 50m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=309033=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-309033 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 09/Sep/19 16:22 Start Date: 09/Sep/19 16:22 Worklog Time Spent: 10m Work Description: zfraa commented on pull request #9519: [BEAM-7013] Use a 0-length byte array to represent empty sketch in HllCount URL: https://github.com/apache/beam/pull/9519#discussion_r322257094 ## File path: sdks/java/extensions/zetasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCountInitFn.java ## @@ -57,6 +57,11 @@ void setPrecision(int precision) { this.precision = precision; } + @Override + public byte[] defaultValue() { +return new byte[0]; Review comment: Maybe add an implementation comment, sth like: "An empty aggregation is represented by an empty byte[], since we cannot create sketches without knowing their type. byte[] is space-efficient, but safer than null. As opposed to returning an empty PCollection, it allows us to return '0' when extracting from the sketch." This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 309033) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 30h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=308657=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-308657 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 09/Sep/19 07:17 Start Date: 09/Sep/19 07:17 Worklog Time Spent: 10m Work Description: robinyqiu commented on pull request #9519: [BEAM-7013] Use a 0-length byte array to represent empty sketch in HllCount URL: https://github.com/apache/beam/pull/9519 r: @zfraa Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily: - [ ] [**Choose reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and mention them in a comment (`R: @username`). - [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). Post-Commit Tests Status (on master branch) Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark --- | --- | --- | --- | --- | --- | --- | --- Go | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/) Java | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/) Python | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/)[![Build
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=304309=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-304309 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 30/Aug/19 15:20 Start Date: 30/Aug/19 15:20 Worklog Time Spent: 10m Work Description: zfraa commented on pull request #9447: [WIP] [BEAM-7013] Handle null input in HllCount.MergePartial and HllCount.Extract URL: https://github.com/apache/beam/pull/9447#discussion_r319555345 ## File path: sdks/java/extensions/zetasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCount.java ## @@ -346,7 +346,9 @@ private Extract() {} @ProcessElement public void processElement( @Element byte[] sketch, OutputReceiver receiver) { - receiver.output(HyperLogLogPlusPlus.forProto(sketch).result()); +Long result = +(sketch == null) ? 0L : HyperLogLogPlusPlus.forProto(sketch).result(); Review comment: Would it make sense to document the nullability of the sketch param somewhere -- method doc or @nullable annotation on param, ... This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 304309) Time Spent: 29h 10m (was: 29h) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 29h 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=304311=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-304311 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 30/Aug/19 15:20 Start Date: 30/Aug/19 15:20 Worklog Time Spent: 10m Work Description: zfraa commented on pull request #9447: [WIP] [BEAM-7013] Handle null input in HllCount.MergePartial and HllCount.Extract URL: https://github.com/apache/beam/pull/9447#discussion_r319557035 ## File path: sdks/java/extensions/zetasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCountMergePartialFn.java ## @@ -57,6 +57,7 @@ private HllCountMergePartialFn() {} @Override public HyperLogLogPlusPlus addInput( @Nullable HyperLogLogPlusPlus accumulator, byte[] input) { +if (input == null) return accumulator; Review comment: Also mark output as nullable now. Can we deal with a null output of addInput(..) throughout? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 304311) Time Spent: 29.5h (was: 29h 20m) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 29.5h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=304310=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-304310 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 30/Aug/19 15:20 Start Date: 30/Aug/19 15:20 Worklog Time Spent: 10m Work Description: zfraa commented on pull request #9447: [WIP] [BEAM-7013] Handle null input in HllCount.MergePartial and HllCount.Extract URL: https://github.com/apache/beam/pull/9447#discussion_r319559575 ## File path: sdks/java/extensions/zetasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCountMergePartialFn.java ## @@ -94,9 +95,9 @@ private HllCountMergePartialFn() {} @Override public byte[] extractOutput(@Nullable HyperLogLogPlusPlus accumulator) { if (accumulator == null) { - throw new IllegalStateException( - "HllCountMergePartialFn.extractOutput() should not be called on a null accumulator."); + return null; Review comment: same here (annotation + is this handled well everywhere downstream?) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 304310) Time Spent: 29h 20m (was: 29h 10m) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 29h 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=304305=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-304305 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 30/Aug/19 14:54 Start Date: 30/Aug/19 14:54 Worklog Time Spent: 10m Work Description: zfraa commented on pull request #9144: [BEAM-7013] Integrating ZetaSketch's HLL++ algorithm with Beam URL: https://github.com/apache/beam/pull/9144#discussion_r319549589 ## File path: sdks/java/extensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/HllCountTest.java ## @@ -0,0 +1,373 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.beam.sdk.extensions.zetasketch; + +import com.google.zetasketch.HyperLogLogPlusPlus; +import com.google.zetasketch.shaded.com.google.protobuf.ByteString; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.Collections; +import java.util.List; +import org.apache.beam.sdk.Pipeline.PipelineExecutionException; +import org.apache.beam.sdk.testing.NeedsRunner; +import org.apache.beam.sdk.testing.PAssert; +import org.apache.beam.sdk.testing.TestPipeline; +import org.apache.beam.sdk.transforms.Create; +import org.apache.beam.sdk.values.KV; +import org.apache.beam.sdk.values.PCollection; +import org.apache.beam.sdk.values.TypeDescriptor; +import org.junit.Rule; +import org.junit.Test; +import org.junit.experimental.categories.Category; +import org.junit.rules.ExpectedException; +import org.junit.runner.RunWith; +import org.junit.runners.JUnit4; + +/** Tests for {@link HllCount}. */ +@RunWith(JUnit4.class) +public class HllCountTest { + + @Rule public final transient TestPipeline p = TestPipeline.create(); + @Rule public transient ExpectedException thrown = ExpectedException.none(); + + // Integer + private static final List INTS1 = Arrays.asList(1, 2, 3, 3, 1, 4); + private static final byte[] INTS1_SKETCH; + private static final Long INTS1_ESTIMATE; + + static { +HyperLogLogPlusPlus hll = new HyperLogLogPlusPlus.Builder().buildForIntegers(); +INTS1.forEach(hll::add); +INTS1_SKETCH = hll.serializeToByteArray(); +INTS1_ESTIMATE = hll.longResult(); + } + + private static final List INTS2 = Arrays.asList(3, 3, 3, 3); + private static final byte[] INTS2_SKETCH; + private static final Long INTS2_ESTIMATE; + + static { +HyperLogLogPlusPlus hll = new HyperLogLogPlusPlus.Builder().buildForIntegers(); +INTS2.forEach(hll::add); +INTS2_SKETCH = hll.serializeToByteArray(); +INTS2_ESTIMATE = hll.longResult(); + } + + private static final byte[] INTS1_INTS2_SKETCH; + + static { +HyperLogLogPlusPlus hll = HyperLogLogPlusPlus.forProto(INTS1_SKETCH); +hll.merge(INTS2_SKETCH); +INTS1_INTS2_SKETCH = hll.serializeToByteArray(); + } + + // Long + private static final List LONGS = Collections.singletonList(1L); + private static final byte[] LONGS_SKETCH; + + static { +HyperLogLogPlusPlus hll = new HyperLogLogPlusPlus.Builder().buildForLongs(); +LONGS.forEach(hll::add); +LONGS_SKETCH = hll.serializeToByteArray(); + } + + private static final byte[] LONGS_EMPTY_SKETCH; + + static { +HyperLogLogPlusPlus hll = new HyperLogLogPlusPlus.Builder().buildForLongs(); +LONGS_EMPTY_SKETCH = hll.serializeToByteArray(); + } + + // String + private static final List STRINGS = Arrays.asList("s1", "s2", "s1", "s2"); + private static final byte[] STRINGS_SKETCH; + + static { +HyperLogLogPlusPlus hll = new HyperLogLogPlusPlus.Builder().buildForStrings(); +STRINGS.forEach(hll::add); +STRINGS_SKETCH = hll.serializeToByteArray(); + } + + private static final int TEST_PRECISION = 20; + private static final byte[] STRINGS_SKETCH_TEST_PRECISION; + + static { +HyperLogLogPlusPlus hll = +new HyperLogLogPlusPlus.Builder().normalPrecision(TEST_PRECISION).buildForStrings(); +STRINGS.forEach(hll::add); +STRINGS_SKETCH_TEST_PRECISION = hll.serializeToByteArray(); + } + + // Bytes + private static final byte[] BYTES0 = {(byte) 0x1, (byte)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=302178=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-302178 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 27/Aug/19 16:35 Start Date: 27/Aug/19 16:35 Worklog Time Spent: 10m Work Description: boyuanzz commented on pull request #9144: [BEAM-7013] Integrating ZetaSketch's HLL++ algorithm with Beam URL: https://github.com/apache/beam/pull/9144 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 302178) Time Spent: 28h 50m (was: 28h 40m) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 28h 50m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=301562=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-301562 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 26/Aug/19 22:37 Start Date: 26/Aug/19 22:37 Worklog Time Spent: 10m Work Description: robinyqiu commented on pull request #9144: [BEAM-7013] Integrating ZetaSketch's HLL++ algorithm with Beam URL: https://github.com/apache/beam/pull/9144#discussion_r317826798 ## File path: sdks/java/extensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/BigQueryHllSketchCompatibilityIT.java ## @@ -49,26 +57,74 @@ @RunWith(JUnit4.class) public class BigQueryHllSketchCompatibilityIT { - private static final String DATASET_NAME = "zetasketch_compatibility_test"; + private static final String APP_NAME; + private static final String PROJECT_ID; + private static final String DATASET_ID; - // Table for testReadSketchFromBigQuery() + private static final List TEST_DATA = + Arrays.asList("Apple", "Orange", "Banana", "Orange"); + + // Data Table: used by testReadSketchFromBigQuery()) // Schema: only one STRING field named "data". // Content: prepopulated with 4 rows: "Apple", "Orange", "Banana", "Orange" - private static final String DATA_TABLE_NAME = "hll_data"; + private static final String DATA_TABLE_ID = "hll_data"; private static final String DATA_FIELD_NAME = "data"; + private static final String DATA_FIELD_TYPE = "STRING"; private static final String QUERY_RESULT_FIELD_NAME = "sketch"; private static final Long EXPECTED_COUNT = 3L; - // Table for testWriteSketchToBigQuery() + // Sketch Table: used by testWriteSketchToBigQuery() // Schema: only one BYTES field named "sketch". // Content: will be overridden by the sketch computed by the test pipeline each time the test runs - private static final String SKETCH_TABLE_NAME = "hll_sketch"; + private static final String SKETCH_TABLE_ID = "hll_sketch"; private static final String SKETCH_FIELD_NAME = "sketch"; - private static final List TEST_DATA = - Arrays.asList("Apple", "Orange", "Banana", "Orange"); + private static final String SKETCH_FIELD_TYPE = "BYTES"; // SHA-1 hash of string "[3]", the string representation of a row that has only one field 3 in it private static final String EXPECTED_CHECKSUM = "f1e31df9806ce94c5bdbbfff9608324930f4d3f1"; + static { +ApplicationNameOptions options = +TestPipeline.testingPipelineOptions().as(ApplicationNameOptions.class); +APP_NAME = options.getAppName(); +PROJECT_ID = options.as(GcpOptions.class).getProject(); +DATASET_ID = String.format("zetasketch_%tY_% A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 28h 40m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=301528=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-301528 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 26/Aug/19 21:35 Start Date: 26/Aug/19 21:35 Worklog Time Spent: 10m Work Description: boyuanzz commented on pull request #9144: [BEAM-7013] Integrating ZetaSketch's HLL++ algorithm with Beam URL: https://github.com/apache/beam/pull/9144#discussion_r317809353 ## File path: sdks/java/extensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/BigQueryHllSketchCompatibilityIT.java ## @@ -49,26 +57,74 @@ @RunWith(JUnit4.class) public class BigQueryHllSketchCompatibilityIT { - private static final String DATASET_NAME = "zetasketch_compatibility_test"; + private static final String APP_NAME; + private static final String PROJECT_ID; + private static final String DATASET_ID; - // Table for testReadSketchFromBigQuery() + private static final List TEST_DATA = + Arrays.asList("Apple", "Orange", "Banana", "Orange"); + + // Data Table: used by testReadSketchFromBigQuery()) // Schema: only one STRING field named "data". // Content: prepopulated with 4 rows: "Apple", "Orange", "Banana", "Orange" - private static final String DATA_TABLE_NAME = "hll_data"; + private static final String DATA_TABLE_ID = "hll_data"; private static final String DATA_FIELD_NAME = "data"; + private static final String DATA_FIELD_TYPE = "STRING"; private static final String QUERY_RESULT_FIELD_NAME = "sketch"; private static final Long EXPECTED_COUNT = 3L; - // Table for testWriteSketchToBigQuery() + // Sketch Table: used by testWriteSketchToBigQuery() // Schema: only one BYTES field named "sketch". // Content: will be overridden by the sketch computed by the test pipeline each time the test runs - private static final String SKETCH_TABLE_NAME = "hll_sketch"; + private static final String SKETCH_TABLE_ID = "hll_sketch"; private static final String SKETCH_FIELD_NAME = "sketch"; - private static final List TEST_DATA = - Arrays.asList("Apple", "Orange", "Banana", "Orange"); + private static final String SKETCH_FIELD_TYPE = "BYTES"; // SHA-1 hash of string "[3]", the string representation of a row that has only one field 3 in it private static final String EXPECTED_CHECKSUM = "f1e31df9806ce94c5bdbbfff9608324930f4d3f1"; + static { +ApplicationNameOptions options = +TestPipeline.testingPipelineOptions().as(ApplicationNameOptions.class); +APP_NAME = options.getAppName(); +PROJECT_ID = options.as(GcpOptions.class).getProject(); +DATASET_ID = String.format("zetasketch_%tY_% bq can't be final since this method is not a constructor. minor: I mean you can make `bq` as a static attr of your test class. > It is fine though because BigqueryClient.getClient() does caching for us so it won't create another new client the second time we call it. minor: I don't think there is any caching logic in `BigqueryClient`. ` BigqueryClient.getClient()` just returns `new BigqueryClient()` simply This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 301528) Time Spent: 28.5h (was: 28h 20m) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 28.5h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=301527=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-301527 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 26/Aug/19 21:30 Start Date: 26/Aug/19 21:30 Worklog Time Spent: 10m Work Description: boyuanzz commented on pull request #9144: [BEAM-7013] Integrating ZetaSketch's HLL++ algorithm with Beam URL: https://github.com/apache/beam/pull/9144#discussion_r317807506 ## File path: sdks/java/extensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/BigQueryHllSketchCompatibilityIT.java ## @@ -49,26 +57,74 @@ @RunWith(JUnit4.class) public class BigQueryHllSketchCompatibilityIT { - private static final String DATASET_NAME = "zetasketch_compatibility_test"; + private static final String APP_NAME; + private static final String PROJECT_ID; + private static final String DATASET_ID; - // Table for testReadSketchFromBigQuery() + private static final List TEST_DATA = + Arrays.asList("Apple", "Orange", "Banana", "Orange"); + + // Data Table: used by testReadSketchFromBigQuery()) // Schema: only one STRING field named "data". // Content: prepopulated with 4 rows: "Apple", "Orange", "Banana", "Orange" - private static final String DATA_TABLE_NAME = "hll_data"; + private static final String DATA_TABLE_ID = "hll_data"; private static final String DATA_FIELD_NAME = "data"; + private static final String DATA_FIELD_TYPE = "STRING"; private static final String QUERY_RESULT_FIELD_NAME = "sketch"; private static final Long EXPECTED_COUNT = 3L; - // Table for testWriteSketchToBigQuery() + // Sketch Table: used by testWriteSketchToBigQuery() // Schema: only one BYTES field named "sketch". // Content: will be overridden by the sketch computed by the test pipeline each time the test runs - private static final String SKETCH_TABLE_NAME = "hll_sketch"; + private static final String SKETCH_TABLE_ID = "hll_sketch"; private static final String SKETCH_FIELD_NAME = "sketch"; - private static final List TEST_DATA = - Arrays.asList("Apple", "Orange", "Banana", "Orange"); + private static final String SKETCH_FIELD_TYPE = "BYTES"; // SHA-1 hash of string "[3]", the string representation of a row that has only one field 3 in it private static final String EXPECTED_CHECKSUM = "f1e31df9806ce94c5bdbbfff9608324930f4d3f1"; + static { +ApplicationNameOptions options = +TestPipeline.testingPipelineOptions().as(ApplicationNameOptions.class); +APP_NAME = options.getAppName(); +PROJECT_ID = options.as(GcpOptions.class).getProject(); +DATASET_ID = String.format("zetasketch_%tY_% A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 28h 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=301524=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-301524 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 26/Aug/19 21:25 Start Date: 26/Aug/19 21:25 Worklog Time Spent: 10m Work Description: robinyqiu commented on issue #9144: [BEAM-7013] Integrating ZetaSketch's HLL++ algorithm with Beam URL: https://github.com/apache/beam/pull/9144#issuecomment-525039196 Thank you all @boyuanzz @zfraa @amaliujia @reuvenlax for the review! I have squashed all the commits into one and it think this PR is ready to be merged once the precommit tests pass. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 301524) Time Spent: 28h 10m (was: 28h) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 28h 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=301523=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-301523 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 26/Aug/19 21:22 Start Date: 26/Aug/19 21:22 Worklog Time Spent: 10m Work Description: robinyqiu commented on pull request #9144: [BEAM-7013] Integrating ZetaSketch's HLL++ algorithm with Beam URL: https://github.com/apache/beam/pull/9144#discussion_r317804729 ## File path: sdks/java/extensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/BigQueryHllSketchCompatibilityIT.java ## @@ -49,26 +57,74 @@ @RunWith(JUnit4.class) public class BigQueryHllSketchCompatibilityIT { - private static final String DATASET_NAME = "zetasketch_compatibility_test"; + private static final String APP_NAME; + private static final String PROJECT_ID; + private static final String DATASET_ID; - // Table for testReadSketchFromBigQuery() + private static final List TEST_DATA = + Arrays.asList("Apple", "Orange", "Banana", "Orange"); + + // Data Table: used by testReadSketchFromBigQuery()) // Schema: only one STRING field named "data". // Content: prepopulated with 4 rows: "Apple", "Orange", "Banana", "Orange" - private static final String DATA_TABLE_NAME = "hll_data"; + private static final String DATA_TABLE_ID = "hll_data"; private static final String DATA_FIELD_NAME = "data"; + private static final String DATA_FIELD_TYPE = "STRING"; private static final String QUERY_RESULT_FIELD_NAME = "sketch"; private static final Long EXPECTED_COUNT = 3L; - // Table for testWriteSketchToBigQuery() + // Sketch Table: used by testWriteSketchToBigQuery() // Schema: only one BYTES field named "sketch". // Content: will be overridden by the sketch computed by the test pipeline each time the test runs - private static final String SKETCH_TABLE_NAME = "hll_sketch"; + private static final String SKETCH_TABLE_ID = "hll_sketch"; private static final String SKETCH_FIELD_NAME = "sketch"; - private static final List TEST_DATA = - Arrays.asList("Apple", "Orange", "Banana", "Orange"); + private static final String SKETCH_FIELD_TYPE = "BYTES"; // SHA-1 hash of string "[3]", the string representation of a row that has only one field 3 in it private static final String EXPECTED_CHECKSUM = "f1e31df9806ce94c5bdbbfff9608324930f4d3f1"; + static { +ApplicationNameOptions options = +TestPipeline.testingPipelineOptions().as(ApplicationNameOptions.class); +APP_NAME = options.getAppName(); +PROJECT_ID = options.as(GcpOptions.class).getProject(); +DATASET_ID = String.format("zetasketch_%tY_% A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 28h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=301522=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-301522 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 26/Aug/19 21:22 Start Date: 26/Aug/19 21:22 Worklog Time Spent: 10m Work Description: robinyqiu commented on pull request #9144: [BEAM-7013] Integrating ZetaSketch's HLL++ algorithm with Beam URL: https://github.com/apache/beam/pull/9144#discussion_r317804549 ## File path: sdks/java/extensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/BigQueryHllSketchCompatibilityIT.java ## @@ -49,26 +57,74 @@ @RunWith(JUnit4.class) public class BigQueryHllSketchCompatibilityIT { - private static final String DATASET_NAME = "zetasketch_compatibility_test"; + private static final String APP_NAME; + private static final String PROJECT_ID; + private static final String DATASET_ID; - // Table for testReadSketchFromBigQuery() + private static final List TEST_DATA = + Arrays.asList("Apple", "Orange", "Banana", "Orange"); + + // Data Table: used by testReadSketchFromBigQuery()) // Schema: only one STRING field named "data". // Content: prepopulated with 4 rows: "Apple", "Orange", "Banana", "Orange" - private static final String DATA_TABLE_NAME = "hll_data"; + private static final String DATA_TABLE_ID = "hll_data"; private static final String DATA_FIELD_NAME = "data"; + private static final String DATA_FIELD_TYPE = "STRING"; private static final String QUERY_RESULT_FIELD_NAME = "sketch"; private static final Long EXPECTED_COUNT = 3L; - // Table for testWriteSketchToBigQuery() + // Sketch Table: used by testWriteSketchToBigQuery() // Schema: only one BYTES field named "sketch". // Content: will be overridden by the sketch computed by the test pipeline each time the test runs - private static final String SKETCH_TABLE_NAME = "hll_sketch"; + private static final String SKETCH_TABLE_ID = "hll_sketch"; private static final String SKETCH_FIELD_NAME = "sketch"; - private static final List TEST_DATA = - Arrays.asList("Apple", "Orange", "Banana", "Orange"); + private static final String SKETCH_FIELD_TYPE = "BYTES"; // SHA-1 hash of string "[3]", the string representation of a row that has only one field 3 in it private static final String EXPECTED_CHECKSUM = "f1e31df9806ce94c5bdbbfff9608324930f4d3f1"; + static { +ApplicationNameOptions options = +TestPipeline.testingPipelineOptions().as(ApplicationNameOptions.class); +APP_NAME = options.getAppName(); +PROJECT_ID = options.as(GcpOptions.class).getProject(); +DATASET_ID = String.format("zetasketch_%tY_% A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 27h 50m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=301406=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-301406 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 26/Aug/19 18:24 Start Date: 26/Aug/19 18:24 Worklog Time Spent: 10m Work Description: boyuanzz commented on pull request #9144: [BEAM-7013] Integrating ZetaSketch's HLL++ algorithm with Beam URL: https://github.com/apache/beam/pull/9144#discussion_r317731169 ## File path: sdks/java/extensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/BigQueryHllSketchCompatibilityIT.java ## @@ -49,26 +57,74 @@ @RunWith(JUnit4.class) public class BigQueryHllSketchCompatibilityIT { - private static final String DATASET_NAME = "zetasketch_compatibility_test"; + private static final String APP_NAME; + private static final String PROJECT_ID; + private static final String DATASET_ID; - // Table for testReadSketchFromBigQuery() + private static final List TEST_DATA = + Arrays.asList("Apple", "Orange", "Banana", "Orange"); + + // Data Table: used by testReadSketchFromBigQuery()) // Schema: only one STRING field named "data". // Content: prepopulated with 4 rows: "Apple", "Orange", "Banana", "Orange" - private static final String DATA_TABLE_NAME = "hll_data"; + private static final String DATA_TABLE_ID = "hll_data"; private static final String DATA_FIELD_NAME = "data"; + private static final String DATA_FIELD_TYPE = "STRING"; private static final String QUERY_RESULT_FIELD_NAME = "sketch"; private static final Long EXPECTED_COUNT = 3L; - // Table for testWriteSketchToBigQuery() + // Sketch Table: used by testWriteSketchToBigQuery() // Schema: only one BYTES field named "sketch". // Content: will be overridden by the sketch computed by the test pipeline each time the test runs - private static final String SKETCH_TABLE_NAME = "hll_sketch"; + private static final String SKETCH_TABLE_ID = "hll_sketch"; private static final String SKETCH_FIELD_NAME = "sketch"; - private static final List TEST_DATA = - Arrays.asList("Apple", "Orange", "Banana", "Orange"); + private static final String SKETCH_FIELD_TYPE = "BYTES"; // SHA-1 hash of string "[3]", the string representation of a row that has only one field 3 in it private static final String EXPECTED_CHECKSUM = "f1e31df9806ce94c5bdbbfff9608324930f4d3f1"; + static { +ApplicationNameOptions options = +TestPipeline.testingPipelineOptions().as(ApplicationNameOptions.class); +APP_NAME = options.getAppName(); +PROJECT_ID = options.as(GcpOptions.class).getProject(); +DATASET_ID = String.format("zetasketch_%tY_% A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 27h 40m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=301405=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-301405 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 26/Aug/19 18:24 Start Date: 26/Aug/19 18:24 Worklog Time Spent: 10m Work Description: boyuanzz commented on pull request #9144: [BEAM-7013] Integrating ZetaSketch's HLL++ algorithm with Beam URL: https://github.com/apache/beam/pull/9144#discussion_r317730080 ## File path: sdks/java/extensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/BigQueryHllSketchCompatibilityIT.java ## @@ -49,26 +57,74 @@ @RunWith(JUnit4.class) public class BigQueryHllSketchCompatibilityIT { - private static final String DATASET_NAME = "zetasketch_compatibility_test"; + private static final String APP_NAME; + private static final String PROJECT_ID; + private static final String DATASET_ID; - // Table for testReadSketchFromBigQuery() + private static final List TEST_DATA = + Arrays.asList("Apple", "Orange", "Banana", "Orange"); + + // Data Table: used by testReadSketchFromBigQuery()) // Schema: only one STRING field named "data". // Content: prepopulated with 4 rows: "Apple", "Orange", "Banana", "Orange" - private static final String DATA_TABLE_NAME = "hll_data"; + private static final String DATA_TABLE_ID = "hll_data"; private static final String DATA_FIELD_NAME = "data"; + private static final String DATA_FIELD_TYPE = "STRING"; private static final String QUERY_RESULT_FIELD_NAME = "sketch"; private static final Long EXPECTED_COUNT = 3L; - // Table for testWriteSketchToBigQuery() + // Sketch Table: used by testWriteSketchToBigQuery() // Schema: only one BYTES field named "sketch". // Content: will be overridden by the sketch computed by the test pipeline each time the test runs - private static final String SKETCH_TABLE_NAME = "hll_sketch"; + private static final String SKETCH_TABLE_ID = "hll_sketch"; private static final String SKETCH_FIELD_NAME = "sketch"; - private static final List TEST_DATA = - Arrays.asList("Apple", "Orange", "Banana", "Orange"); + private static final String SKETCH_FIELD_TYPE = "BYTES"; // SHA-1 hash of string "[3]", the string representation of a row that has only one field 3 in it private static final String EXPECTED_CHECKSUM = "f1e31df9806ce94c5bdbbfff9608324930f4d3f1"; + static { +ApplicationNameOptions options = +TestPipeline.testingPipelineOptions().as(ApplicationNameOptions.class); +APP_NAME = options.getAppName(); +PROJECT_ID = options.as(GcpOptions.class).getProject(); +DATASET_ID = String.format("zetasketch_%tY_% A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 27.5h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=301403=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-301403 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 26/Aug/19 18:21 Start Date: 26/Aug/19 18:21 Worklog Time Spent: 10m Work Description: robinyqiu commented on issue #9144: [BEAM-7013] Integrating ZetaSketch's HLL++ algorithm with Beam URL: https://github.com/apache/beam/pull/9144#issuecomment-524971095 Run Java PostCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 301403) Time Spent: 27h 20m (was: 27h 10m) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 27h 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=300067=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-300067 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 23/Aug/19 06:09 Start Date: 23/Aug/19 06:09 Worklog Time Spent: 10m Work Description: robinyqiu commented on issue #9144: [BEAM-7013] Integrating ZetaSketch's HLL++ algorithm with Beam URL: https://github.com/apache/beam/pull/9144#issuecomment-524185713 Run Java PostCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 300067) Time Spent: 27h 10m (was: 27h) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 27h 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=299822=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-299822 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 22/Aug/19 23:33 Start Date: 22/Aug/19 23:33 Worklog Time Spent: 10m Work Description: robinyqiu commented on issue #9144: [BEAM-7013] Integrating ZetaSketch's HLL++ algorithm with Beam URL: https://github.com/apache/beam/pull/9144#issuecomment-524116749 Run Java PostCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 299822) Time Spent: 27h (was: 26h 50m) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 27h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=299820=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-299820 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 22/Aug/19 23:28 Start Date: 22/Aug/19 23:28 Worklog Time Spent: 10m Work Description: robinyqiu commented on issue #9144: [BEAM-7013] Integrating ZetaSketch's HLL++ algorithm with Beam URL: https://github.com/apache/beam/pull/9144#issuecomment-524115723 I have made the change such that the BQ tables needed for testing is now created before the tests and deleted after the tests. PTAL. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 299820) Time Spent: 26h 50m (was: 26h 40m) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 26h 50m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=299819=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-299819 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 22/Aug/19 23:27 Start Date: 22/Aug/19 23:27 Worklog Time Spent: 10m Work Description: robinyqiu commented on pull request #9144: [BEAM-7013] Integrating ZetaSketch's HLL++ algorithm with Beam URL: https://github.com/apache/beam/pull/9144#discussion_r316924297 ## File path: sdks/java/extensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/BigQueryHllSketchCompatibilityIT.java ## @@ -0,0 +1,145 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.beam.sdk.extensions.zetasketch; + +import com.google.api.services.bigquery.model.TableFieldSchema; +import com.google.api.services.bigquery.model.TableRow; +import com.google.api.services.bigquery.model.TableSchema; +import java.nio.ByteBuffer; +import java.util.Arrays; +import java.util.Collections; +import java.util.List; +import org.apache.beam.sdk.Pipeline; +import org.apache.beam.sdk.coders.ByteArrayCoder; +import org.apache.beam.sdk.extensions.gcp.options.GcpOptions; +import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO; +import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.TypedRead.Method; +import org.apache.beam.sdk.io.gcp.bigquery.SchemaAndRecord; +import org.apache.beam.sdk.io.gcp.testing.BigqueryMatcher; +import org.apache.beam.sdk.options.ApplicationNameOptions; +import org.apache.beam.sdk.testing.PAssert; +import org.apache.beam.sdk.testing.TestPipeline; +import org.apache.beam.sdk.testing.TestPipelineOptions; +import org.apache.beam.sdk.transforms.Create; +import org.apache.beam.sdk.transforms.SerializableFunction; +import org.apache.beam.sdk.values.PCollection; +import org.junit.Test; +import org.junit.runner.RunWith; +import org.junit.runners.JUnit4; + +/** + * Integration tests for HLL++ sketch compatibility between Beam and BigQuery. The tests verifies + * that HLL++ sketches created in Beam can be processed by BigQuery, and vice versa. + */ +@RunWith(JUnit4.class) +public class BigQueryHllSketchCompatibilityIT { + + private static final String DATASET_NAME = "zetasketch_compatibility_test"; + + // Table for testReadSketchFromBigQuery() + // Schema: only one STRING field named "data". + // Content: prepopulated with 4 rows: "Apple", "Orange", "Banana", "Orange" + private static final String DATA_TABLE_NAME = "hll_data"; + private static final String DATA_FIELD_NAME = "data"; + private static final String QUERY_RESULT_FIELD_NAME = "sketch"; + private static final Long EXPECTED_COUNT = 3L; + + // Table for testWriteSketchToBigQuery() + // Schema: only one BYTES field named "sketch". + // Content: will be overridden by the sketch computed by the test pipeline each time the test runs + private static final String SKETCH_TABLE_NAME = "hll_sketch"; + private static final String SKETCH_FIELD_NAME = "sketch"; + private static final List TEST_DATA = + Arrays.asList("Apple", "Orange", "Banana", "Orange"); + // SHA-1 hash of string "[3]", the string representation of a row that has only one field 3 in it + private static final String EXPECTED_CHECKSUM = "f1e31df9806ce94c5bdbbfff9608324930f4d3f1"; + + /** + * Test that HLL++ sketch computed in BigQuery can be processed by Beam. Hll sketch is computed by + * {@code HLL_COUNT.INIT} in BigQuery and read into Beam; the test verifies that we can run {@link + * HllCount.MergePartial} and {@link HllCount.Extract} on the sketch in Beam to get the correct + * estimated count. + */ + @Test + public void testReadSketchFromBigQuery() { Review comment: Done. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking ---
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=299818=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-299818 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 22/Aug/19 23:27 Start Date: 22/Aug/19 23:27 Worklog Time Spent: 10m Work Description: robinyqiu commented on pull request #9144: [BEAM-7013] Integrating ZetaSketch's HLL++ algorithm with Beam URL: https://github.com/apache/beam/pull/9144#discussion_r316924234 ## File path: sdks/java/extensions/zetasketch/build.gradle ## @@ -0,0 +1,64 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * License); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +import groovy.json.JsonOutput + +plugins { id 'org.apache.beam.module' } +applyJavaNature() + +description = "Apache Beam :: SDKs :: Java :: Extensions :: ZetaSketch" + +def zetasketch_version = "0.1.0" + +dependencies { +compile library.java.vendored_guava_26_0_jre +compile project(path: ":sdks:java:core", configuration: "shadow") +compile "com.google.zetasketch:zetasketch:$zetasketch_version" +testCompile library.java.junit +testCompile project(":sdks:java:io:google-cloud-platform") +testRuntimeOnly project(":runners:direct-java") +testRuntimeOnly project(":runners:google-cloud-dataflow-java") +} + +/** + * Integration tests running on Dataflow with BigQuery. + */ +task integrationTest(type: Test) { +group = "Verification" +def gcpProject = project.findProperty('gcpProject') ?: 'apache-beam-testing' +def gcpTempRoot = project.findProperty('gcpTempRoot') ?: 'gs://temp-storage-for-end-to-end-tests' +systemProperty "beamTestPipelineOptions", JsonOutput.toJson([ +"--runner=TestDataflowRunner", +"--project=${gcpProject}", +"--tempRoot=${gcpTempRoot}", +]) Review comment: Done. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 299818) Time Spent: 26.5h (was: 26h 20m) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 26.5h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=299083=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-299083 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 21/Aug/19 23:57 Start Date: 21/Aug/19 23:57 Worklog Time Spent: 10m Work Description: boyuanzz commented on pull request #9144: [BEAM-7013] Integrating ZetaSketch's HLL++ algorithm with Beam URL: https://github.com/apache/beam/pull/9144#discussion_r316449066 ## File path: sdks/java/extensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/BigQueryHllSketchCompatibilityIT.java ## @@ -0,0 +1,145 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.beam.sdk.extensions.zetasketch; + +import com.google.api.services.bigquery.model.TableFieldSchema; +import com.google.api.services.bigquery.model.TableRow; +import com.google.api.services.bigquery.model.TableSchema; +import java.nio.ByteBuffer; +import java.util.Arrays; +import java.util.Collections; +import java.util.List; +import org.apache.beam.sdk.Pipeline; +import org.apache.beam.sdk.coders.ByteArrayCoder; +import org.apache.beam.sdk.extensions.gcp.options.GcpOptions; +import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO; +import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.TypedRead.Method; +import org.apache.beam.sdk.io.gcp.bigquery.SchemaAndRecord; +import org.apache.beam.sdk.io.gcp.testing.BigqueryMatcher; +import org.apache.beam.sdk.options.ApplicationNameOptions; +import org.apache.beam.sdk.testing.PAssert; +import org.apache.beam.sdk.testing.TestPipeline; +import org.apache.beam.sdk.testing.TestPipelineOptions; +import org.apache.beam.sdk.transforms.Create; +import org.apache.beam.sdk.transforms.SerializableFunction; +import org.apache.beam.sdk.values.PCollection; +import org.junit.Test; +import org.junit.runner.RunWith; +import org.junit.runners.JUnit4; + +/** + * Integration tests for HLL++ sketch compatibility between Beam and BigQuery. The tests verifies + * that HLL++ sketches created in Beam can be processed by BigQuery, and vice versa. + */ +@RunWith(JUnit4.class) +public class BigQueryHllSketchCompatibilityIT { + + private static final String DATASET_NAME = "zetasketch_compatibility_test"; + + // Table for testReadSketchFromBigQuery() + // Schema: only one STRING field named "data". + // Content: prepopulated with 4 rows: "Apple", "Orange", "Banana", "Orange" + private static final String DATA_TABLE_NAME = "hll_data"; + private static final String DATA_FIELD_NAME = "data"; + private static final String QUERY_RESULT_FIELD_NAME = "sketch"; + private static final Long EXPECTED_COUNT = 3L; + + // Table for testWriteSketchToBigQuery() + // Schema: only one BYTES field named "sketch". + // Content: will be overridden by the sketch computed by the test pipeline each time the test runs + private static final String SKETCH_TABLE_NAME = "hll_sketch"; + private static final String SKETCH_FIELD_NAME = "sketch"; + private static final List TEST_DATA = + Arrays.asList("Apple", "Orange", "Banana", "Orange"); + // SHA-1 hash of string "[3]", the string representation of a row that has only one field 3 in it + private static final String EXPECTED_CHECKSUM = "f1e31df9806ce94c5bdbbfff9608324930f4d3f1"; + + /** + * Test that HLL++ sketch computed in BigQuery can be processed by Beam. Hll sketch is computed by + * {@code HLL_COUNT.INIT} in BigQuery and read into Beam; the test verifies that we can run {@link + * HllCount.MergePartial} and {@link HllCount.Extract} on the sketch in Beam to get the correct + * estimated count. + */ + @Test + public void testReadSketchFromBigQuery() { Review comment: https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/testing/BigqueryClient.java is the util class that could help BQ ITs to create/cleanup test data. This is an automated message from the Apache Git Service. To respond to the message,
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=299082=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-299082 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 21/Aug/19 23:52 Start Date: 21/Aug/19 23:52 Worklog Time Spent: 10m Work Description: robinyqiu commented on pull request #9144: [BEAM-7013] Integrating ZetaSketch's HLL++ algorithm with Beam URL: https://github.com/apache/beam/pull/9144#discussion_r316447983 ## File path: sdks/java/extensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/BigQueryHllSketchCompatibilityIT.java ## @@ -0,0 +1,145 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.beam.sdk.extensions.zetasketch; + +import com.google.api.services.bigquery.model.TableFieldSchema; +import com.google.api.services.bigquery.model.TableRow; +import com.google.api.services.bigquery.model.TableSchema; +import java.nio.ByteBuffer; +import java.util.Arrays; +import java.util.Collections; +import java.util.List; +import org.apache.beam.sdk.Pipeline; +import org.apache.beam.sdk.coders.ByteArrayCoder; +import org.apache.beam.sdk.extensions.gcp.options.GcpOptions; +import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO; +import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.TypedRead.Method; +import org.apache.beam.sdk.io.gcp.bigquery.SchemaAndRecord; +import org.apache.beam.sdk.io.gcp.testing.BigqueryMatcher; +import org.apache.beam.sdk.options.ApplicationNameOptions; +import org.apache.beam.sdk.testing.PAssert; +import org.apache.beam.sdk.testing.TestPipeline; +import org.apache.beam.sdk.testing.TestPipelineOptions; +import org.apache.beam.sdk.transforms.Create; +import org.apache.beam.sdk.transforms.SerializableFunction; +import org.apache.beam.sdk.values.PCollection; +import org.junit.Test; +import org.junit.runner.RunWith; +import org.junit.runners.JUnit4; + +/** + * Integration tests for HLL++ sketch compatibility between Beam and BigQuery. The tests verifies + * that HLL++ sketches created in Beam can be processed by BigQuery, and vice versa. + */ +@RunWith(JUnit4.class) +public class BigQueryHllSketchCompatibilityIT { + + private static final String DATASET_NAME = "zetasketch_compatibility_test"; + + // Table for testReadSketchFromBigQuery() + // Schema: only one STRING field named "data". + // Content: prepopulated with 4 rows: "Apple", "Orange", "Banana", "Orange" + private static final String DATA_TABLE_NAME = "hll_data"; + private static final String DATA_FIELD_NAME = "data"; + private static final String QUERY_RESULT_FIELD_NAME = "sketch"; + private static final Long EXPECTED_COUNT = 3L; + + // Table for testWriteSketchToBigQuery() + // Schema: only one BYTES field named "sketch". + // Content: will be overridden by the sketch computed by the test pipeline each time the test runs + private static final String SKETCH_TABLE_NAME = "hll_sketch"; + private static final String SKETCH_FIELD_NAME = "sketch"; + private static final List TEST_DATA = + Arrays.asList("Apple", "Orange", "Banana", "Orange"); + // SHA-1 hash of string "[3]", the string representation of a row that has only one field 3 in it + private static final String EXPECTED_CHECKSUM = "f1e31df9806ce94c5bdbbfff9608324930f4d3f1"; + + /** + * Test that HLL++ sketch computed in BigQuery can be processed by Beam. Hll sketch is computed by + * {@code HLL_COUNT.INIT} in BigQuery and read into Beam; the test verifies that we can run {@link + * HllCount.MergePartial} and {@link HllCount.Extract} on the sketch in Beam to get the correct + * estimated count. + */ + @Test + public void testReadSketchFromBigQuery() { Review comment: This is a very good point. I agree with you that ideally a test should be self-contained and not depend on any external resources. The reason why I did that is simply because the other BigQueryIO integration tests under this
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=299074=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-299074 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 21/Aug/19 23:43 Start Date: 21/Aug/19 23:43 Worklog Time Spent: 10m Work Description: robinyqiu commented on pull request #9144: [BEAM-7013] Integrating ZetaSketch's HLL++ algorithm with Beam URL: https://github.com/apache/beam/pull/9144#discussion_r316446092 ## File path: sdks/java/extensions/zetasketch/build.gradle ## @@ -0,0 +1,64 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * License); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +import groovy.json.JsonOutput + +plugins { id 'org.apache.beam.module' } +applyJavaNature() + +description = "Apache Beam :: SDKs :: Java :: Extensions :: ZetaSketch" + +def zetasketch_version = "0.1.0" + +dependencies { +compile library.java.vendored_guava_26_0_jre +compile project(path: ":sdks:java:core", configuration: "shadow") +compile "com.google.zetasketch:zetasketch:$zetasketch_version" +testCompile library.java.junit +testCompile project(":sdks:java:io:google-cloud-platform") +testRuntimeOnly project(":runners:direct-java") +testRuntimeOnly project(":runners:google-cloud-dataflow-java") +} + +/** + * Integration tests running on Dataflow with BigQuery. + */ +task integrationTest(type: Test) { +group = "Verification" +def gcpProject = project.findProperty('gcpProject') ?: 'apache-beam-testing' +def gcpTempRoot = project.findProperty('gcpTempRoot') ?: 'gs://temp-storage-for-end-to-end-tests' +systemProperty "beamTestPipelineOptions", JsonOutput.toJson([ +"--runner=TestDataflowRunner", +"--project=${gcpProject}", +"--tempRoot=${gcpTempRoot}", +]) Review comment: Thanks for the pointer. Will do. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 299074) Time Spent: 26h (was: 25h 50m) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 26h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=299073=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-299073 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 21/Aug/19 23:33 Start Date: 21/Aug/19 23:33 Worklog Time Spent: 10m Work Description: robinyqiu commented on pull request #9144: [BEAM-7013] Integrating ZetaSketch's HLL++ algorithm with Beam URL: https://github.com/apache/beam/pull/9144#discussion_r316444192 ## File path: sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/testing/BigqueryMatcher.java ## @@ -63,32 +63,52 @@ private final String projectId; private final String query; + private final boolean usingStandardSql; Review comment: I have considered this. But in our Beam `BigQueryIO` source we also have https://github.com/apache/beam/blob/08d0146791e38be4641ff80ffb2539cdc81f5b6d/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L600 Since this is a public API visible to Beam users but the BQ client is not, I decided to be consistent with the former (and therefore I have to negate the boolean somewhere in the function call stack). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 299073) Time Spent: 25h 50m (was: 25h 40m) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 25h 50m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=299003=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-299003 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 21/Aug/19 20:50 Start Date: 21/Aug/19 20:50 Worklog Time Spent: 10m Work Description: boyuanzz commented on pull request #9144: [BEAM-7013] Integrating ZetaSketch's HLL++ algorithm with Beam URL: https://github.com/apache/beam/pull/9144#discussion_r316396289 ## File path: sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/testing/BigqueryMatcher.java ## @@ -63,32 +63,52 @@ private final String projectId; private final String query; + private final boolean usingStandardSql; Review comment: Thanks for the pointer! Then maybe we should setup `usingLegacySql` instead of `usingStandardSql` to keep consistent with bq model. wdyt? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 299003) Time Spent: 25h 40m (was: 25.5h) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 25h 40m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=299000=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-299000 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 21/Aug/19 20:46 Start Date: 21/Aug/19 20:46 Worklog Time Spent: 10m Work Description: boyuanzz commented on pull request #9144: [BEAM-7013] Integrating ZetaSketch's HLL++ algorithm with Beam URL: https://github.com/apache/beam/pull/9144#discussion_r316394576 ## File path: sdks/java/extensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/BigQueryHllSketchCompatibilityIT.java ## @@ -0,0 +1,145 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.beam.sdk.extensions.zetasketch; + +import com.google.api.services.bigquery.model.TableFieldSchema; +import com.google.api.services.bigquery.model.TableRow; +import com.google.api.services.bigquery.model.TableSchema; +import java.nio.ByteBuffer; +import java.util.Arrays; +import java.util.Collections; +import java.util.List; +import org.apache.beam.sdk.Pipeline; +import org.apache.beam.sdk.coders.ByteArrayCoder; +import org.apache.beam.sdk.extensions.gcp.options.GcpOptions; +import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO; +import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.TypedRead.Method; +import org.apache.beam.sdk.io.gcp.bigquery.SchemaAndRecord; +import org.apache.beam.sdk.io.gcp.testing.BigqueryMatcher; +import org.apache.beam.sdk.options.ApplicationNameOptions; +import org.apache.beam.sdk.testing.PAssert; +import org.apache.beam.sdk.testing.TestPipeline; +import org.apache.beam.sdk.testing.TestPipelineOptions; +import org.apache.beam.sdk.transforms.Create; +import org.apache.beam.sdk.transforms.SerializableFunction; +import org.apache.beam.sdk.values.PCollection; +import org.junit.Test; +import org.junit.runner.RunWith; +import org.junit.runners.JUnit4; + +/** + * Integration tests for HLL++ sketch compatibility between Beam and BigQuery. The tests verifies + * that HLL++ sketches created in Beam can be processed by BigQuery, and vice versa. + */ +@RunWith(JUnit4.class) +public class BigQueryHllSketchCompatibilityIT { + + private static final String DATASET_NAME = "zetasketch_compatibility_test"; + + // Table for testReadSketchFromBigQuery() + // Schema: only one STRING field named "data". + // Content: prepopulated with 4 rows: "Apple", "Orange", "Banana", "Orange" + private static final String DATA_TABLE_NAME = "hll_data"; + private static final String DATA_FIELD_NAME = "data"; + private static final String QUERY_RESULT_FIELD_NAME = "sketch"; + private static final Long EXPECTED_COUNT = 3L; + + // Table for testWriteSketchToBigQuery() + // Schema: only one BYTES field named "sketch". + // Content: will be overridden by the sketch computed by the test pipeline each time the test runs + private static final String SKETCH_TABLE_NAME = "hll_sketch"; + private static final String SKETCH_FIELD_NAME = "sketch"; + private static final List TEST_DATA = + Arrays.asList("Apple", "Orange", "Banana", "Orange"); + // SHA-1 hash of string "[3]", the string representation of a row that has only one field 3 in it + private static final String EXPECTED_CHECKSUM = "f1e31df9806ce94c5bdbbfff9608324930f4d3f1"; + + /** + * Test that HLL++ sketch computed in BigQuery can be processed by Beam. Hll sketch is computed by + * {@code HLL_COUNT.INIT} in BigQuery and read into Beam; the test verifies that we can run {@link + * HllCount.MergePartial} and {@link HllCount.Extract} on the sketch in Beam to get the correct + * estimated count. + */ + @Test + public void testReadSketchFromBigQuery() { Review comment: Any reason choosing to create test data manually? IMO, it would make operations harder under certain scenarios. For example, our infra team decides to using a project to run all ITs, then your test will be broken. Instead, how about creating your test data in `@BeforeClass` and deleting all data in `@AfterClass`?
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=298996=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-298996 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 21/Aug/19 20:40 Start Date: 21/Aug/19 20:40 Worklog Time Spent: 10m Work Description: boyuanzz commented on pull request #9144: [BEAM-7013] Integrating ZetaSketch's HLL++ algorithm with Beam URL: https://github.com/apache/beam/pull/9144#discussion_r316392094 ## File path: sdks/java/extensions/zetasketch/build.gradle ## @@ -0,0 +1,64 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * License); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +import groovy.json.JsonOutput + +plugins { id 'org.apache.beam.module' } +applyJavaNature() + +description = "Apache Beam :: SDKs :: Java :: Extensions :: ZetaSketch" + +def zetasketch_version = "0.1.0" + +dependencies { +compile library.java.vendored_guava_26_0_jre +compile project(path: ":sdks:java:core", configuration: "shadow") +compile "com.google.zetasketch:zetasketch:$zetasketch_version" +testCompile library.java.junit +testCompile project(":sdks:java:io:google-cloud-platform") +testRuntimeOnly project(":runners:direct-java") +testRuntimeOnly project(":runners:google-cloud-dataflow-java") +} + +/** + * Integration tests running on Dataflow with BigQuery. + */ +task integrationTest(type: Test) { +group = "Verification" +def gcpProject = project.findProperty('gcpProject') ?: 'apache-beam-testing' +def gcpTempRoot = project.findProperty('gcpTempRoot') ?: 'gs://temp-storage-for-end-to-end-tests' +systemProperty "beamTestPipelineOptions", JsonOutput.toJson([ +"--runner=TestDataflowRunner", +"--project=${gcpProject}", +"--tempRoot=${gcpTempRoot}", +]) Review comment: But you want to run our test with DataflowRunner right? Then if you want to always run with the worker head, you need to add 2 more cmd args like: https://github.com/apache/beam/blob/master/runners/google-cloud-dataflow-java/build.gradle#L161. Otherwise, the prebuilt worker image is used. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 298996) Time Spent: 25h 20m (was: 25h 10m) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 25h 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=298287=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-298287 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 20/Aug/19 22:51 Start Date: 20/Aug/19 22:51 Worklog Time Spent: 10m Work Description: robinyqiu commented on issue #9144: [BEAM-7013] Integrating ZetaSketch's HLL++ algorithm with Beam URL: https://github.com/apache/beam/pull/9144#issuecomment-523225422 Hi, @boyuanzz and @zfraa. Thanks again for your review! I have answered your questions and made necessary changes to the code. PTAL. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 298287) Time Spent: 25h 10m (was: 25h) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 25h 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=298280=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-298280 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 20/Aug/19 22:48 Start Date: 20/Aug/19 22:48 Worklog Time Spent: 10m Work Description: robinyqiu commented on pull request #9144: [BEAM-7013] Integrating ZetaSketch's HLL++ algorithm with Beam URL: https://github.com/apache/beam/pull/9144#discussion_r315901106 ## File path: sdks/java/extensions/zetasketch/build.gradle ## @@ -0,0 +1,64 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * License); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +import groovy.json.JsonOutput + Review comment: Interesting. I looked into the usage of `evaluationDependsOn`. From the comments in the link you shared: https://github.com/apache/beam/blob/ff7964c7252c8a0c670f69bb4291230ca6136afd/runners/direct-java/build.gradle#L44-L48 it says it is only needed because in that script `sourceSets.test.output` of another project is directly referenced https://github.com/apache/beam/blob/ff7964c7252c8a0c670f69bb4291230ca6136afd/runners/direct-java/build.gradle#L102-L104 And I searched throughout the codebase and it seems to be the only reason that people adding `evaluationDependsOn` to their build script. However, direct reference of `sourceSets.test.output` is discouraged. Quote from https://discuss.gradle.org/t/evaluationdependson-annoyance/21783/2: >While you can reference tasks from other projects, it really should be a last resort. Using evaluationDependsOn is a bit of a smell. Dependencies, configurations, and artifacts are a better model than directly accessing outputs of a task from another project. So I would prefer not to add it to this script because we are not doing the direct referencing here. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 298280) Time Spent: 24h 50m (was: 24h 40m) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 24h 50m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=298281=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-298281 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 20/Aug/19 22:48 Start Date: 20/Aug/19 22:48 Worklog Time Spent: 10m Work Description: robinyqiu commented on pull request #9144: [BEAM-7013] Integrating ZetaSketch's HLL++ algorithm with Beam URL: https://github.com/apache/beam/pull/9144#discussion_r315901106 ## File path: sdks/java/extensions/zetasketch/build.gradle ## @@ -0,0 +1,64 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * License); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +import groovy.json.JsonOutput + Review comment: Interesting. I looked into the usage of `evaluationDependsOn`. From the comments in the link you shared: https://github.com/apache/beam/blob/ff7964c7252c8a0c670f69bb4291230ca6136afd/runners/direct-java/build.gradle#L46-L48 it says it is only needed because in that script `sourceSets.test.output` of another project is directly referenced https://github.com/apache/beam/blob/ff7964c7252c8a0c670f69bb4291230ca6136afd/runners/direct-java/build.gradle#L102-L104 And I searched throughout the codebase and it seems to be the only reason that people adding `evaluationDependsOn` to their build script. However, direct reference of `sourceSets.test.output` is discouraged. Quote from https://discuss.gradle.org/t/evaluationdependson-annoyance/21783/2: >While you can reference tasks from other projects, it really should be a last resort. Using evaluationDependsOn is a bit of a smell. Dependencies, configurations, and artifacts are a better model than directly accessing outputs of a task from another project. So I would prefer not to add it to this script because we are not doing the direct referencing here. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 298281) Time Spent: 25h (was: 24h 50m) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 25h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=298279=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-298279 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 20/Aug/19 22:47 Start Date: 20/Aug/19 22:47 Worklog Time Spent: 10m Work Description: robinyqiu commented on pull request #9144: [BEAM-7013] Integrating ZetaSketch's HLL++ algorithm with Beam URL: https://github.com/apache/beam/pull/9144#discussion_r315901106 ## File path: sdks/java/extensions/zetasketch/build.gradle ## @@ -0,0 +1,64 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * License); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +import groovy.json.JsonOutput + Review comment: Interesting. I looked into the usage of `evaluationDependsOn`. From the comments in the link you shared: https://github.com/apache/beam/blob/ff7964c7252c8a0c670f69bb4291230ca6136afd/runners/direct-java/build.gradle#L50 it says it is only needed because in that script `sourceSets.test.output` of another project is directly referenced https://github.com/apache/beam/blob/ff7964c7252c8a0c670f69bb4291230ca6136afd/runners/direct-java/build.gradle#L102-L104 And I searched throughout the codebase and it seems to be the only reason that people adding `evaluationDependsOn` to their build script. However, direct reference of `sourceSets.test.output` is discouraged. Quote from https://discuss.gradle.org/t/evaluationdependson-annoyance/21783/2: >While you can reference tasks from other projects, it really should be a last resort. Using evaluationDependsOn is a bit of a smell. Dependencies, configurations, and artifacts are a better model than directly accessing outputs of a task from another project. So I would prefer not to add it to this script because we are not doing the direct referencing here. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 298279) Time Spent: 24h 40m (was: 24.5h) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 24h 40m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=298277=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-298277 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 20/Aug/19 22:46 Start Date: 20/Aug/19 22:46 Worklog Time Spent: 10m Work Description: robinyqiu commented on pull request #9144: [BEAM-7013] Integrating ZetaSketch's HLL++ algorithm with Beam URL: https://github.com/apache/beam/pull/9144#discussion_r315901106 ## File path: sdks/java/extensions/zetasketch/build.gradle ## @@ -0,0 +1,64 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * License); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +import groovy.json.JsonOutput + Review comment: Interesting. I looked into the usage of `evaluationDependsOn`. From the comments in the link you shared: https://github.com/apache/beam/blob/master/runners/direct-java/build.gradle#L46-L48 it says it is only needed because in that script `sourceSets.test.output` of another project is directly referenced https://github.com/apache/beam/blob/master/runners/direct-java/build.gradle#L102-L104 And I searched throughout the codebase and it seems to be the only reason that people adding `evaluationDependsOn` to their build script. However, direct reference of `sourceSets.test.output` is discouraged. Quote from https://discuss.gradle.org/t/evaluationdependson-annoyance/21783/2: >While you can reference tasks from other projects, it really should be a last resort. Using evaluationDependsOn is a bit of a smell. Dependencies, configurations, and artifacts are a better model than directly accessing outputs of a task from another project. So I would prefer not to add it to this script because we are not doing the direct referencing here. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 298277) Time Spent: 24.5h (was: 24h 20m) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 24.5h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=298276=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-298276 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 20/Aug/19 22:45 Start Date: 20/Aug/19 22:45 Worklog Time Spent: 10m Work Description: robinyqiu commented on pull request #9144: [BEAM-7013] Integrating ZetaSketch's HLL++ algorithm with Beam URL: https://github.com/apache/beam/pull/9144#discussion_r315901106 ## File path: sdks/java/extensions/zetasketch/build.gradle ## @@ -0,0 +1,64 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * License); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +import groovy.json.JsonOutput + Review comment: Interesting. I looked into the usage of `evaluationDependsOn`. From the comments in the link you shared https://github.com/apache/beam/blob/master/runners/direct-java/build.gradle#L46-L48 it says it is only needed because in that script `sourceSets.test.output` of another project is directly referenced https://github.com/apache/beam/blob/master/runners/direct-java/build.gradle#L102-L104 And I searched throughout the codebase and it seems to be the only reason that people adding `evaluationDependsOn` to their build script. However, direct reference of `sourceSets.test.output` is discouraged. Quote from https://discuss.gradle.org/t/evaluationdependson-annoyance/21783/2: >While you can reference tasks from other projects, it really should be a last resort. Using evaluationDependsOn is a bit of a smell. Dependencies, configurations, and artifacts are a better model than directly accessing outputs of a task from another project. So I would prefer not to add it to this script because we are not doing the direct referencing here. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 298276) Time Spent: 24h 20m (was: 24h 10m) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 24h 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=298275=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-298275 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 20/Aug/19 22:45 Start Date: 20/Aug/19 22:45 Worklog Time Spent: 10m Work Description: robinyqiu commented on pull request #9144: [BEAM-7013] Integrating ZetaSketch's HLL++ algorithm with Beam URL: https://github.com/apache/beam/pull/9144#discussion_r315901106 ## File path: sdks/java/extensions/zetasketch/build.gradle ## @@ -0,0 +1,64 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * License); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +import groovy.json.JsonOutput + Review comment: Interesting. I looked into the usage of `evaluationDependsOn`. From the comments in the link you shared https://github.com/apache/beam/blob/master/runners/direct-java/build.gradle#L46-L48 it says it is only needed because in that script `sourceSets.test.output` of another project is directly referenced https://github.com/apache/beam/blob/master/runners/direct-java/build.gradle#L102-L104 And I searched throughout the codebase and it seems to be the only reason that people adding `evaluationDependsOn` to their build script. However, direct reference of `sourceSets.test.output` is discouraged. Quote from https://discuss.gradle.org/t/evaluationdependson-annoyance/21783/2: >While you can reference tasks from other projects, it really should be a last resort. Using evaluationDependsOn is a bit of a smell. Dependencies, configurations, and artifacts are a better model than directly accessing outputs of a task from another project. So I would prefer not to add it to this script because we are not doing the direct referencing here. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 298275) Time Spent: 24h 10m (was: 24h) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 24h 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=298186=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-298186 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 20/Aug/19 21:00 Start Date: 20/Aug/19 21:00 Worklog Time Spent: 10m Work Description: robinyqiu commented on pull request #9144: [BEAM-7013] Integrating ZetaSketch's HLL++ algorithm with Beam URL: https://github.com/apache/beam/pull/9144#discussion_r315901106 ## File path: sdks/java/extensions/zetasketch/build.gradle ## @@ -0,0 +1,64 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * License); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +import groovy.json.JsonOutput + Review comment: Interesting. I looked into the usage of `evaluationDependsOn`. From the comments in the link you shared https://github.com/apache/beam/blob/master/runners/direct-java/build.gradle#L46-L48 it says it is only needed because in that script `sourceSets.test.output` of another project is directly referenced https://github.com/apache/beam/blob/master/runners/direct-java/build.gradle#L102-L104 And I searched throughout the codebase and it seems to be the only reason that people adding `evaluationDependsOn` to their build script. However, direct reference of `sourceSets.test.output` is discouraged. Quote from https://discuss.gradle.org/t/evaluationdependson-annoyance/21783/2: >While you can reference tasks from other projects, it really should be a last resort. Using evaluationDependsOn is a bit of a smell. Dependencies, configurations, and artifacts are a better model than directly accessing outputs of a task from another project. So I would prefer not to add it to this script because we are not doing the direct referencing here. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 298186) Time Spent: 24h (was: 23h 50m) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 24h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=298178=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-298178 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 20/Aug/19 20:51 Start Date: 20/Aug/19 20:51 Worklog Time Spent: 10m Work Description: robinyqiu commented on pull request #9144: [BEAM-7013] Integrating ZetaSketch's HLL++ algorithm with Beam URL: https://github.com/apache/beam/pull/9144#discussion_r315874860 ## File path: sdks/java/extensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/HllCountTest.java ## @@ -0,0 +1,373 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.beam.sdk.extensions.zetasketch; + +import com.google.zetasketch.HyperLogLogPlusPlus; +import com.google.zetasketch.shaded.com.google.protobuf.ByteString; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.Collections; +import java.util.List; +import org.apache.beam.sdk.Pipeline.PipelineExecutionException; +import org.apache.beam.sdk.testing.NeedsRunner; +import org.apache.beam.sdk.testing.PAssert; +import org.apache.beam.sdk.testing.TestPipeline; +import org.apache.beam.sdk.transforms.Create; +import org.apache.beam.sdk.values.KV; +import org.apache.beam.sdk.values.PCollection; +import org.apache.beam.sdk.values.TypeDescriptor; +import org.junit.Rule; +import org.junit.Test; +import org.junit.experimental.categories.Category; +import org.junit.rules.ExpectedException; +import org.junit.runner.RunWith; +import org.junit.runners.JUnit4; + +/** Tests for {@link HllCount}. */ +@RunWith(JUnit4.class) +public class HllCountTest { + + @Rule public final transient TestPipeline p = TestPipeline.create(); + @Rule public transient ExpectedException thrown = ExpectedException.none(); + + // Integer + private static final List INTS1 = Arrays.asList(1, 2, 3, 3, 1, 4); + private static final byte[] INTS1_SKETCH; + private static final Long INTS1_ESTIMATE; + + static { +HyperLogLogPlusPlus hll = new HyperLogLogPlusPlus.Builder().buildForIntegers(); +INTS1.forEach(hll::add); +INTS1_SKETCH = hll.serializeToByteArray(); +INTS1_ESTIMATE = hll.longResult(); + } + + private static final List INTS2 = Arrays.asList(3, 3, 3, 3); + private static final byte[] INTS2_SKETCH; + private static final Long INTS2_ESTIMATE; + + static { +HyperLogLogPlusPlus hll = new HyperLogLogPlusPlus.Builder().buildForIntegers(); +INTS2.forEach(hll::add); +INTS2_SKETCH = hll.serializeToByteArray(); +INTS2_ESTIMATE = hll.longResult(); + } + + private static final byte[] INTS1_INTS2_SKETCH; + + static { +HyperLogLogPlusPlus hll = HyperLogLogPlusPlus.forProto(INTS1_SKETCH); +hll.merge(INTS2_SKETCH); +INTS1_INTS2_SKETCH = hll.serializeToByteArray(); + } + + // Long + private static final List LONGS = Collections.singletonList(1L); + private static final byte[] LONGS_SKETCH; + + static { +HyperLogLogPlusPlus hll = new HyperLogLogPlusPlus.Builder().buildForLongs(); +LONGS.forEach(hll::add); +LONGS_SKETCH = hll.serializeToByteArray(); + } + + private static final byte[] LONGS_EMPTY_SKETCH; + + static { +HyperLogLogPlusPlus hll = new HyperLogLogPlusPlus.Builder().buildForLongs(); +LONGS_EMPTY_SKETCH = hll.serializeToByteArray(); + } + + // String + private static final List STRINGS = Arrays.asList("s1", "s2", "s1", "s2"); + private static final byte[] STRINGS_SKETCH; + + static { +HyperLogLogPlusPlus hll = new HyperLogLogPlusPlus.Builder().buildForStrings(); +STRINGS.forEach(hll::add); +STRINGS_SKETCH = hll.serializeToByteArray(); + } + + private static final int TEST_PRECISION = 20; + private static final byte[] STRINGS_SKETCH_TEST_PRECISION; + + static { +HyperLogLogPlusPlus hll = +new HyperLogLogPlusPlus.Builder().normalPrecision(TEST_PRECISION).buildForStrings(); +STRINGS.forEach(hll::add); +STRINGS_SKETCH_TEST_PRECISION = hll.serializeToByteArray(); + } + + // Bytes + private static final byte[] BYTES0 = {(byte) 0x1, (byte)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=298177=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-298177 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 20/Aug/19 20:50 Start Date: 20/Aug/19 20:50 Worklog Time Spent: 10m Work Description: robinyqiu commented on pull request #9144: [BEAM-7013] Integrating ZetaSketch's HLL++ algorithm with Beam URL: https://github.com/apache/beam/pull/9144#discussion_r315874860 ## File path: sdks/java/extensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/HllCountTest.java ## @@ -0,0 +1,373 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.beam.sdk.extensions.zetasketch; + +import com.google.zetasketch.HyperLogLogPlusPlus; +import com.google.zetasketch.shaded.com.google.protobuf.ByteString; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.Collections; +import java.util.List; +import org.apache.beam.sdk.Pipeline.PipelineExecutionException; +import org.apache.beam.sdk.testing.NeedsRunner; +import org.apache.beam.sdk.testing.PAssert; +import org.apache.beam.sdk.testing.TestPipeline; +import org.apache.beam.sdk.transforms.Create; +import org.apache.beam.sdk.values.KV; +import org.apache.beam.sdk.values.PCollection; +import org.apache.beam.sdk.values.TypeDescriptor; +import org.junit.Rule; +import org.junit.Test; +import org.junit.experimental.categories.Category; +import org.junit.rules.ExpectedException; +import org.junit.runner.RunWith; +import org.junit.runners.JUnit4; + +/** Tests for {@link HllCount}. */ +@RunWith(JUnit4.class) +public class HllCountTest { + + @Rule public final transient TestPipeline p = TestPipeline.create(); + @Rule public transient ExpectedException thrown = ExpectedException.none(); + + // Integer + private static final List INTS1 = Arrays.asList(1, 2, 3, 3, 1, 4); + private static final byte[] INTS1_SKETCH; + private static final Long INTS1_ESTIMATE; + + static { +HyperLogLogPlusPlus hll = new HyperLogLogPlusPlus.Builder().buildForIntegers(); +INTS1.forEach(hll::add); +INTS1_SKETCH = hll.serializeToByteArray(); +INTS1_ESTIMATE = hll.longResult(); + } + + private static final List INTS2 = Arrays.asList(3, 3, 3, 3); + private static final byte[] INTS2_SKETCH; + private static final Long INTS2_ESTIMATE; + + static { +HyperLogLogPlusPlus hll = new HyperLogLogPlusPlus.Builder().buildForIntegers(); +INTS2.forEach(hll::add); +INTS2_SKETCH = hll.serializeToByteArray(); +INTS2_ESTIMATE = hll.longResult(); + } + + private static final byte[] INTS1_INTS2_SKETCH; + + static { +HyperLogLogPlusPlus hll = HyperLogLogPlusPlus.forProto(INTS1_SKETCH); +hll.merge(INTS2_SKETCH); +INTS1_INTS2_SKETCH = hll.serializeToByteArray(); + } + + // Long + private static final List LONGS = Collections.singletonList(1L); + private static final byte[] LONGS_SKETCH; + + static { +HyperLogLogPlusPlus hll = new HyperLogLogPlusPlus.Builder().buildForLongs(); +LONGS.forEach(hll::add); +LONGS_SKETCH = hll.serializeToByteArray(); + } + + private static final byte[] LONGS_EMPTY_SKETCH; + + static { +HyperLogLogPlusPlus hll = new HyperLogLogPlusPlus.Builder().buildForLongs(); +LONGS_EMPTY_SKETCH = hll.serializeToByteArray(); + } + + // String + private static final List STRINGS = Arrays.asList("s1", "s2", "s1", "s2"); + private static final byte[] STRINGS_SKETCH; + + static { +HyperLogLogPlusPlus hll = new HyperLogLogPlusPlus.Builder().buildForStrings(); +STRINGS.forEach(hll::add); +STRINGS_SKETCH = hll.serializeToByteArray(); + } + + private static final int TEST_PRECISION = 20; + private static final byte[] STRINGS_SKETCH_TEST_PRECISION; + + static { +HyperLogLogPlusPlus hll = +new HyperLogLogPlusPlus.Builder().normalPrecision(TEST_PRECISION).buildForStrings(); +STRINGS.forEach(hll::add); +STRINGS_SKETCH_TEST_PRECISION = hll.serializeToByteArray(); + } + + // Bytes + private static final byte[] BYTES0 = {(byte) 0x1, (byte)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=298174=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-298174 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 20/Aug/19 20:49 Start Date: 20/Aug/19 20:49 Worklog Time Spent: 10m Work Description: robinyqiu commented on pull request #9144: [BEAM-7013] Integrating ZetaSketch's HLL++ algorithm with Beam URL: https://github.com/apache/beam/pull/9144#discussion_r315874860 ## File path: sdks/java/extensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/HllCountTest.java ## @@ -0,0 +1,373 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.beam.sdk.extensions.zetasketch; + +import com.google.zetasketch.HyperLogLogPlusPlus; +import com.google.zetasketch.shaded.com.google.protobuf.ByteString; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.Collections; +import java.util.List; +import org.apache.beam.sdk.Pipeline.PipelineExecutionException; +import org.apache.beam.sdk.testing.NeedsRunner; +import org.apache.beam.sdk.testing.PAssert; +import org.apache.beam.sdk.testing.TestPipeline; +import org.apache.beam.sdk.transforms.Create; +import org.apache.beam.sdk.values.KV; +import org.apache.beam.sdk.values.PCollection; +import org.apache.beam.sdk.values.TypeDescriptor; +import org.junit.Rule; +import org.junit.Test; +import org.junit.experimental.categories.Category; +import org.junit.rules.ExpectedException; +import org.junit.runner.RunWith; +import org.junit.runners.JUnit4; + +/** Tests for {@link HllCount}. */ +@RunWith(JUnit4.class) +public class HllCountTest { + + @Rule public final transient TestPipeline p = TestPipeline.create(); + @Rule public transient ExpectedException thrown = ExpectedException.none(); + + // Integer + private static final List INTS1 = Arrays.asList(1, 2, 3, 3, 1, 4); + private static final byte[] INTS1_SKETCH; + private static final Long INTS1_ESTIMATE; + + static { +HyperLogLogPlusPlus hll = new HyperLogLogPlusPlus.Builder().buildForIntegers(); +INTS1.forEach(hll::add); +INTS1_SKETCH = hll.serializeToByteArray(); +INTS1_ESTIMATE = hll.longResult(); + } + + private static final List INTS2 = Arrays.asList(3, 3, 3, 3); + private static final byte[] INTS2_SKETCH; + private static final Long INTS2_ESTIMATE; + + static { +HyperLogLogPlusPlus hll = new HyperLogLogPlusPlus.Builder().buildForIntegers(); +INTS2.forEach(hll::add); +INTS2_SKETCH = hll.serializeToByteArray(); +INTS2_ESTIMATE = hll.longResult(); + } + + private static final byte[] INTS1_INTS2_SKETCH; + + static { +HyperLogLogPlusPlus hll = HyperLogLogPlusPlus.forProto(INTS1_SKETCH); +hll.merge(INTS2_SKETCH); +INTS1_INTS2_SKETCH = hll.serializeToByteArray(); + } + + // Long + private static final List LONGS = Collections.singletonList(1L); + private static final byte[] LONGS_SKETCH; + + static { +HyperLogLogPlusPlus hll = new HyperLogLogPlusPlus.Builder().buildForLongs(); +LONGS.forEach(hll::add); +LONGS_SKETCH = hll.serializeToByteArray(); + } + + private static final byte[] LONGS_EMPTY_SKETCH; + + static { +HyperLogLogPlusPlus hll = new HyperLogLogPlusPlus.Builder().buildForLongs(); +LONGS_EMPTY_SKETCH = hll.serializeToByteArray(); + } + + // String + private static final List STRINGS = Arrays.asList("s1", "s2", "s1", "s2"); + private static final byte[] STRINGS_SKETCH; + + static { +HyperLogLogPlusPlus hll = new HyperLogLogPlusPlus.Builder().buildForStrings(); +STRINGS.forEach(hll::add); +STRINGS_SKETCH = hll.serializeToByteArray(); + } + + private static final int TEST_PRECISION = 20; + private static final byte[] STRINGS_SKETCH_TEST_PRECISION; + + static { +HyperLogLogPlusPlus hll = +new HyperLogLogPlusPlus.Builder().normalPrecision(TEST_PRECISION).buildForStrings(); +STRINGS.forEach(hll::add); +STRINGS_SKETCH_TEST_PRECISION = hll.serializeToByteArray(); + } + + // Bytes + private static final byte[] BYTES0 = {(byte) 0x1, (byte)
[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation
[ https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=298158=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-298158 ] ASF GitHub Bot logged work on BEAM-7013: Author: ASF GitHub Bot Created on: 20/Aug/19 20:40 Start Date: 20/Aug/19 20:40 Worklog Time Spent: 10m Work Description: robinyqiu commented on pull request #9144: [BEAM-7013] Integrating ZetaSketch's HLL++ algorithm with Beam URL: https://github.com/apache/beam/pull/9144#discussion_r315892837 ## File path: sdks/java/extensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/HllCountTest.java ## @@ -0,0 +1,373 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.beam.sdk.extensions.zetasketch; + +import com.google.zetasketch.HyperLogLogPlusPlus; +import com.google.zetasketch.shaded.com.google.protobuf.ByteString; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.Collections; +import java.util.List; +import org.apache.beam.sdk.Pipeline.PipelineExecutionException; +import org.apache.beam.sdk.testing.NeedsRunner; +import org.apache.beam.sdk.testing.PAssert; +import org.apache.beam.sdk.testing.TestPipeline; +import org.apache.beam.sdk.transforms.Create; +import org.apache.beam.sdk.values.KV; +import org.apache.beam.sdk.values.PCollection; +import org.apache.beam.sdk.values.TypeDescriptor; +import org.junit.Rule; +import org.junit.Test; +import org.junit.experimental.categories.Category; +import org.junit.rules.ExpectedException; +import org.junit.runner.RunWith; +import org.junit.runners.JUnit4; + +/** Tests for {@link HllCount}. */ +@RunWith(JUnit4.class) Review comment: It is included. Locally you can run `./gradlew :sdks:java:extensions:zetasketch:test` to execute the tests. On Jenkins it is included as Java PreCommit test: https://builds.apache.org/job/beam_PreCommit_Java_Commit/7374/testReport/org.apache.beam.sdk.extensions.zetasketch/ This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 298158) Time Spent: 23h 20m (was: 23h 10m) > A new count distinct transform based on BigQuery compatible HyperLogLog++ > implementation > > > Key: BEAM-7013 > URL: https://issues.apache.org/jira/browse/BEAM-7013 > Project: Beam > Issue Type: New Feature > Components: extensions-java-sketching, sdk-java-core >Reporter: Yueyang Qiu >Assignee: Yueyang Qiu >Priority: Major > Fix For: 2.16.0 > > Time Spent: 23h 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.2#803003)