[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-11-11 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=341438=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-341438
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 11/Nov/19 19:11
Start Date: 11/Nov/19 19:11
Worklog Time Spent: 10m 
  Work Description: boyuanzz commented on pull request #9778: [BEAM-7013] 
Update BigQueryHllSketchCompatibilityIT to cover empty sketch cases
URL: https://github.com/apache/beam/pull/9778
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 341438)
Time Spent: 37h 20m  (was: 37h 10m)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 37h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-11-06 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=339648=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339648
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 07/Nov/19 00:29
Start Date: 07/Nov/19 00:29
Worklog Time Spent: 10m 
  Work Description: robinyqiu commented on issue #9778: [BEAM-7013] Update 
BigQueryHllSketchCompatibilityIT to cover empty sketch cases
URL: https://github.com/apache/beam/pull/9778#issuecomment-550564200
 
 
   Run Java PostCommit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 339648)
Time Spent: 37h 10m  (was: 37h)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 37h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-11-06 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=339542=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339542
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 06/Nov/19 18:47
Start Date: 06/Nov/19 18:47
Worklog Time Spent: 10m 
  Work Description: robinyqiu commented on issue #9778: [BEAM-7013] Update 
BigQueryHllSketchCompatibilityIT to cover empty sketch cases
URL: https://github.com/apache/beam/pull/9778#issuecomment-550447746
 
 
   Run Java_Examples_Dataflow PreCommit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 339542)
Time Spent: 37h  (was: 36h 50m)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 37h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-11-06 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=339541=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-339541
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 06/Nov/19 18:47
Start Date: 06/Nov/19 18:47
Worklog Time Spent: 10m 
  Work Description: robinyqiu commented on issue #9778: [BEAM-7013] Update 
BigQueryHllSketchCompatibilityIT to cover empty sketch cases
URL: https://github.com/apache/beam/pull/9778#issuecomment-550447688
 
 
   Run Java PreCommit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 339541)
Time Spent: 36h 50m  (was: 36h 40m)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 36h 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-11-04 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=338301=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-338301
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 04/Nov/19 18:45
Start Date: 04/Nov/19 18:45
Worklog Time Spent: 10m 
  Work Description: robinyqiu commented on pull request #9778: [BEAM-7013] 
Update BigQueryHllSketchCompatibilityIT to cover empty sketch cases
URL: https://github.com/apache/beam/pull/9778#discussion_r342208431
 
 

 ##
 File path: 
sdks/java/extensions/zetasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCount.java
 ##
 @@ -107,6 +109,20 @@
   // Cannot be instantiated. This class is intended to be a namespace only.
   private HllCount() {}
 
+  /**
+   * Returns the sketch stored as bytes in the input {@code ByteBuffer}. If 
the input {@code
 
 Review comment:
   Done. Agree that this does sounds more clear!
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 338301)
Time Spent: 36h 40m  (was: 36.5h)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 36h 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-11-04 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=338241=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-338241
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 04/Nov/19 17:44
Start Date: 04/Nov/19 17:44
Worklog Time Spent: 10m 
  Work Description: zfraa commented on pull request #9778: [BEAM-7013] 
Update BigQueryHllSketchCompatibilityIT to cover empty sketch cases
URL: https://github.com/apache/beam/pull/9778#discussion_r342170633
 
 

 ##
 File path: 
sdks/java/extensions/zetasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCount.java
 ##
 @@ -107,6 +109,20 @@
   // Cannot be instantiated. This class is intended to be a namespace only.
   private HllCount() {}
 
+  /**
+   * Returns the sketch stored as bytes in the input {@code ByteBuffer}. If 
the input {@code
 
 Review comment:
   Was confused here at first (when not being aware that this is a static 
method, one can read it as "the user passes in an empty/reusable ByteBuffer and 
the library returns the serialized sketch in that ByteBuffer"). 
   To avoid that not-so-alert readers think that ;), how about 
   "Converts the passed-in sketch from ByteBuffer to byte[], mapping null 
ByteBuffers (representing empty sketches) to empty byte[]. Utility method to 
convert sketches materialized with ZetaSQL/BigQuery to valid input for HllCount 
transforms."? 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 338241)
Time Spent: 36.5h  (was: 36h 20m)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 36.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-11-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=337612=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-337612
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 01/Nov/19 23:16
Start Date: 01/Nov/19 23:16
Worklog Time Spent: 10m 
  Work Description: robinyqiu commented on issue #9778: [BEAM-7013] Update 
BigQueryHllSketchCompatibilityIT to cover empty sketch cases
URL: https://github.com/apache/beam/pull/9778#issuecomment-548981025
 
 
   > is there any way we could make the conversion functors 
(parseQueryResultToByteArray, and the inlined one for the other direction) 
available to users as utility objects or methods?
   
   I have considered this but unfortunately we cannot do that, because in the 
same function users might want to parse other fields. However I do find a way 
to extract part of the logic into its own function. See the 
`getSketchFromByteBuffer` function.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 337612)
Time Spent: 36h 20m  (was: 36h 10m)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 36h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-10-31 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=337122=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-337122
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 01/Nov/19 00:45
Start Date: 01/Nov/19 00:45
Worklog Time Spent: 10m 
  Work Description: robinyqiu commented on pull request #9778: [BEAM-7013] 
Update BigQueryHllSketchCompatibilityIT to cover empty sketch cases
URL: https://github.com/apache/beam/pull/9778#discussion_r341417808
 
 

 ##
 File path: 
sdks/java/extensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/BigQueryHllSketchCompatibilityIT.java
 ##
 @@ -65,23 +66,32 @@
   private static final List TEST_DATA =
   Arrays.asList("Apple", "Orange", "Banana", "Orange");
 
-  // Data Table: used by testReadSketchFromBigQuery())
+  // Data Table: used by tests reading sketches from BigQuery
   // Schema: only one STRING field named "data".
-  // Content: prepopulated with 4 rows: "Apple", "Orange", "Banana", "Orange"
-  private static final String DATA_TABLE_ID = "hll_data";
   private static final String DATA_FIELD_NAME = "data";
   private static final String DATA_FIELD_TYPE = "STRING";
   private static final String QUERY_RESULT_FIELD_NAME = "sketch";
-  private static final Long EXPECTED_COUNT = 3L;
 
-  // Sketch Table: used by testWriteSketchToBigQuery()
+  // Content: prepopulated with 4 rows: "Apple", "Orange", "Banana", "Orange"
+  private static final String DATA_TABLE_ID_NON_EMPTY = "hll_data_non_empty";
+  private static final Long EXPECTED_COUNT_NON_EMPTY = 3L;
+
+  // Content: empty
+  private static final String DATA_TABLE_ID_EMPTY = "hll_data_empty";
 
 Review comment:
   Yes it does and I have tried it (although it is not mentioned in the 
BigQuery documentation).
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 337122)
Time Spent: 36h 10m  (was: 36h)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 36h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-10-31 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=337118=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-337118
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 01/Nov/19 00:29
Start Date: 01/Nov/19 00:29
Worklog Time Spent: 10m 
  Work Description: robinyqiu commented on pull request #9778: [BEAM-7013] 
Update BigQueryHllSketchCompatibilityIT to cover empty sketch cases
URL: https://github.com/apache/beam/pull/9778#discussion_r341415540
 
 

 ##
 File path: 
sdks/java/extensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/BigQueryHllSketchCompatibilityIT.java
 ##
 @@ -126,22 +145,49 @@ public static void deleteDataset() throws Exception {
   }
 
   /**
-   * Test that HLL++ sketch computed in BigQuery can be processed by Beam. Hll 
sketch is computed by
-   * {@code HLL_COUNT.INIT} in BigQuery and read into Beam; the test verifies 
that we can run {@link
-   * HllCount.MergePartial} and {@link HllCount.Extract} on the sketch in Beam 
to get the correct
-   * estimated count.
+   * Test that non-empty HLL++ sketch computed in BigQuery can be processed by 
Beam.
+   *
+   * Hll sketch is computed by {@code HLL_COUNT.INIT} in BigQuery and read 
into Beam; the test
 
 Review comment:
   Fixed! Thanks for the link, that's a good read. :)
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 337118)
Time Spent: 36h  (was: 35h 50m)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 36h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-10-31 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=337110=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-337110
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 01/Nov/19 00:23
Start Date: 01/Nov/19 00:23
Worklog Time Spent: 10m 
  Work Description: robinyqiu commented on pull request #9778: [BEAM-7013] 
Update BigQueryHllSketchCompatibilityIT to cover empty sketch cases
URL: https://github.com/apache/beam/pull/9778#discussion_r341414587
 
 

 ##
 File path: 
sdks/java/extensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/BigQueryHllSketchCompatibilityIT.java
 ##
 @@ -126,22 +145,49 @@ public static void deleteDataset() throws Exception {
   }
 
   /**
-   * Test that HLL++ sketch computed in BigQuery can be processed by Beam. Hll 
sketch is computed by
-   * {@code HLL_COUNT.INIT} in BigQuery and read into Beam; the test verifies 
that we can run {@link
-   * HllCount.MergePartial} and {@link HllCount.Extract} on the sketch in Beam 
to get the correct
-   * estimated count.
+   * Test that non-empty HLL++ sketch computed in BigQuery can be processed by 
Beam.
+   *
+   * Hll sketch is computed by {@code HLL_COUNT.INIT} in BigQuery and read 
into Beam; the test
+   * verifies that we can run {@link HllCount.MergePartial} and {@link 
HllCount.Extract} on the
+   * sketch in Beam to get the correct estimated count.
+   */
+  @Test
+  public void testReadNonEmptySketchFromBigQuery() {
+readSketchFromBigQuery(DATA_TABLE_ID_NON_EMPTY, EXPECTED_COUNT_NON_EMPTY);
+  }
+
+  /**
+   * Test that empty HLL++ sketch computed in BigQuery can be processed by 
Beam.
+   *
+   * Hll sketch is computed by {@code HLL_COUNT.INIT} in BigQuery and read 
into Beam; the test
+   * verifies that we can run {@link HllCount.MergePartial} and {@link 
HllCount.Extract} on the
+   * sketch in Beam to get the correct estimated count.
*/
   @Test
-  public void testReadSketchFromBigQuery() {
-String tableSpec = String.format("%s.%s", DATASET_ID, DATA_TABLE_ID);
+  public void testReadEmptySketchFromBigQuery() {
+readSketchFromBigQuery(DATA_TABLE_ID_EMPTY, EXPECTED_COUNT_EMPTY);
+  }
+
+  private void readSketchFromBigQuery(String tableId, Long expectedCount) {
+String tableSpec = String.format("%s.%s", DATASET_ID, tableId);
 String query =
 String.format(
 "SELECT HLL_COUNT.INIT(%s) AS %s FROM %s",
 DATA_FIELD_NAME, QUERY_RESULT_FIELD_NAME, tableSpec);
+
 SerializableFunction parseQueryResultToByteArray =
-(SchemaAndRecord schemaAndRecord) ->
-// BigQuery BYTES type corresponds to Java java.nio.ByteBuffer type
-((ByteBuffer) 
schemaAndRecord.getRecord().get(QUERY_RESULT_FIELD_NAME)).array();
+input -> {
+  // BigQuery BYTES type corresponds to Java java.nio.ByteBuffer type
+  ByteBuffer sketch = (ByteBuffer) 
input.getRecord().get(QUERY_RESULT_FIELD_NAME);
+  if (sketch == null) {
+// Empty sketch is represented by null in BigQuery and by empty 
byte array in Beam
+return new byte[0];
+  } else {
+byte[] result = new byte[sketch.remaining()];
 
 Review comment:
   Exactly. We know that is the case by looking into `Avro`'s implementation, 
but compiler does not know that, and it gives a warning if we use `.array()`.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 337110)
Time Spent: 35h 50m  (was: 35h 40m)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 35h 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-10-31 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=337108=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-337108
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 01/Nov/19 00:21
Start Date: 01/Nov/19 00:21
Worklog Time Spent: 10m 
  Work Description: robinyqiu commented on pull request #9778: [BEAM-7013] 
Update BigQueryHllSketchCompatibilityIT to cover empty sketch cases
URL: https://github.com/apache/beam/pull/9778#discussion_r341414290
 
 

 ##
 File path: 
sdks/java/extensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/BigQueryHllSketchCompatibilityIT.java
 ##
 @@ -96,28 +106,37 @@
   public static void prepareDatasetAndDataTable() throws Exception {
 
 Review comment:
   Ah, nice catch!
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 337108)
Time Spent: 35h 40m  (was: 35.5h)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 35h 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-10-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=330565=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330565
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 18/Oct/19 15:20
Start Date: 18/Oct/19 15:20
Worklog Time Spent: 10m 
  Work Description: zfraa commented on pull request #9778: [BEAM-7013] 
Update BigQueryHllSketchCompatibilityIT to cover empty sketch cases
URL: https://github.com/apache/beam/pull/9778#discussion_r336525227
 
 

 ##
 File path: 
sdks/java/extensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/BigQueryHllSketchCompatibilityIT.java
 ##
 @@ -126,22 +145,49 @@ public static void deleteDataset() throws Exception {
   }
 
   /**
-   * Test that HLL++ sketch computed in BigQuery can be processed by Beam. Hll 
sketch is computed by
-   * {@code HLL_COUNT.INIT} in BigQuery and read into Beam; the test verifies 
that we can run {@link
-   * HllCount.MergePartial} and {@link HllCount.Extract} on the sketch in Beam 
to get the correct
-   * estimated count.
+   * Test that non-empty HLL++ sketch computed in BigQuery can be processed by 
Beam.
+   *
+   * Hll sketch is computed by {@code HLL_COUNT.INIT} in BigQuery and read 
into Beam; the test
+   * verifies that we can run {@link HllCount.MergePartial} and {@link 
HllCount.Extract} on the
+   * sketch in Beam to get the correct estimated count.
+   */
+  @Test
+  public void testReadNonEmptySketchFromBigQuery() {
+readSketchFromBigQuery(DATA_TABLE_ID_NON_EMPTY, EXPECTED_COUNT_NON_EMPTY);
+  }
+
+  /**
+   * Test that empty HLL++ sketch computed in BigQuery can be processed by 
Beam.
 
 Review comment:
   (same here: "Test that an empty...", "The HLL sketch...")
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 330565)
Time Spent: 35h 20m  (was: 35h 10m)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 35h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-10-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=330563=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330563
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 18/Oct/19 15:20
Start Date: 18/Oct/19 15:20
Worklog Time Spent: 10m 
  Work Description: zfraa commented on pull request #9778: [BEAM-7013] 
Update BigQueryHllSketchCompatibilityIT to cover empty sketch cases
URL: https://github.com/apache/beam/pull/9778#discussion_r336535846
 
 

 ##
 File path: 
sdks/java/extensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/BigQueryHllSketchCompatibilityIT.java
 ##
 @@ -96,28 +106,37 @@
   public static void prepareDatasetAndDataTable() throws Exception {
 BIGQUERY_CLIENT.createNewDataset(PROJECT_ID, DATASET_ID);
 
-// Create Data Table
 TableSchema dataTableSchema =
 new TableSchema()
 .setFields(
 Collections.singletonList(
 new 
TableFieldSchema().setName(DATA_FIELD_NAME).setType(DATA_FIELD_TYPE)));
-Table dataTable =
+
+Table dataTableNonEmpty =
 new Table()
 .setSchema(dataTableSchema)
 .setTableReference(
 new TableReference()
 .setProjectId(PROJECT_ID)
 .setDatasetId(DATASET_ID)
-.setTableId(DATA_TABLE_ID));
-BIGQUERY_CLIENT.createNewTable(PROJECT_ID, DATASET_ID, dataTable);
-
+.setTableId(DATA_TABLE_ID_NON_EMPTY));
+BIGQUERY_CLIENT.createNewTable(PROJECT_ID, DATASET_ID, dataTableNonEmpty);
 // Prepopulate test data to Data Table
 
 Review comment:
   "Prepopulate data tables with test data"
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 330563)
Time Spent: 35h 10m  (was: 35h)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 35h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-10-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=330564=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330564
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 18/Oct/19 15:20
Start Date: 18/Oct/19 15:20
Worklog Time Spent: 10m 
  Work Description: zfraa commented on pull request #9778: [BEAM-7013] 
Update BigQueryHllSketchCompatibilityIT to cover empty sketch cases
URL: https://github.com/apache/beam/pull/9778#discussion_r336534833
 
 

 ##
 File path: 
sdks/java/extensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/BigQueryHllSketchCompatibilityIT.java
 ##
 @@ -96,28 +106,37 @@
   public static void prepareDatasetAndDataTable() throws Exception {
 
 Review comment:
   Maybe rename to "...AndDataTables()"?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 330564)
Time Spent: 35h 20m  (was: 35h 10m)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 35h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-10-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=330566=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330566
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 18/Oct/19 15:20
Start Date: 18/Oct/19 15:20
Worklog Time Spent: 10m 
  Work Description: zfraa commented on pull request #9778: [BEAM-7013] 
Update BigQueryHllSketchCompatibilityIT to cover empty sketch cases
URL: https://github.com/apache/beam/pull/9778#discussion_r336524398
 
 

 ##
 File path: 
sdks/java/extensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/BigQueryHllSketchCompatibilityIT.java
 ##
 @@ -126,22 +145,49 @@ public static void deleteDataset() throws Exception {
   }
 
   /**
-   * Test that HLL++ sketch computed in BigQuery can be processed by Beam. Hll 
sketch is computed by
-   * {@code HLL_COUNT.INIT} in BigQuery and read into Beam; the test verifies 
that we can run {@link
-   * HllCount.MergePartial} and {@link HllCount.Extract} on the sketch in Beam 
to get the correct
-   * estimated count.
+   * Test that non-empty HLL++ sketch computed in BigQuery can be processed by 
Beam.
+   *
+   * Hll sketch is computed by {@code HLL_COUNT.INIT} in BigQuery and read 
into Beam; the test
 
 Review comment:
   Nit: "Test that a HLL++ sketch...", and "The HLL sketch is computed...". 
Otherwise, LGTM!
   Also, all Javadoc should be in third person ("Tests that..." instead of 
"Test that"; see 
https://www.oracle.com/technetwork/articles/java/index-137868.html, "Use 3rd 
person..."). Sorry that I missed this in the first version of this code! 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 330566)
Time Spent: 35.5h  (was: 35h 20m)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 35.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-10-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=330561=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330561
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 18/Oct/19 15:20
Start Date: 18/Oct/19 15:20
Worklog Time Spent: 10m 
  Work Description: zfraa commented on pull request #9778: [BEAM-7013] 
Update BigQueryHllSketchCompatibilityIT to cover empty sketch cases
URL: https://github.com/apache/beam/pull/9778#discussion_r336542489
 
 

 ##
 File path: 
sdks/java/extensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/BigQueryHllSketchCompatibilityIT.java
 ##
 @@ -65,23 +66,32 @@
   private static final List TEST_DATA =
   Arrays.asList("Apple", "Orange", "Banana", "Orange");
 
-  // Data Table: used by testReadSketchFromBigQuery())
+  // Data Table: used by tests reading sketches from BigQuery
   // Schema: only one STRING field named "data".
-  // Content: prepopulated with 4 rows: "Apple", "Orange", "Banana", "Orange"
-  private static final String DATA_TABLE_ID = "hll_data";
   private static final String DATA_FIELD_NAME = "data";
   private static final String DATA_FIELD_TYPE = "STRING";
   private static final String QUERY_RESULT_FIELD_NAME = "sketch";
-  private static final Long EXPECTED_COUNT = 3L;
 
-  // Sketch Table: used by testWriteSketchToBigQuery()
+  // Content: prepopulated with 4 rows: "Apple", "Orange", "Banana", "Orange"
+  private static final String DATA_TABLE_ID_NON_EMPTY = "hll_data_non_empty";
+  private static final Long EXPECTED_COUNT_NON_EMPTY = 3L;
+
+  // Content: empty
+  private static final String DATA_TABLE_ID_EMPTY = "hll_data_empty";
 
 Review comment:
   Does the aggregation (HLL_COUNT.INIT) over an empty table return a NULL 
sketch, as expected? I.e., did the test fail before you modified 
'parseQueryResultToByteArray' to deal with NULLs? I think it should (according 
to 
https://plx.corp.google.com/scripts2/script_5d._6fe78c__2144_8a38_883d24fc4a60,
 last SELECT), but double-checking. 
   I'd wish we could add an assert, but that doesn't work with the utility 
method. 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 330561)
Time Spent: 34h 50m  (was: 34h 40m)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 34h 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-10-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=330562=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330562
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 18/Oct/19 15:20
Start Date: 18/Oct/19 15:20
Worklog Time Spent: 10m 
  Work Description: zfraa commented on pull request #9778: [BEAM-7013] 
Update BigQueryHllSketchCompatibilityIT to cover empty sketch cases
URL: https://github.com/apache/beam/pull/9778#discussion_r336529565
 
 

 ##
 File path: 
sdks/java/extensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/BigQueryHllSketchCompatibilityIT.java
 ##
 @@ -126,22 +145,49 @@ public static void deleteDataset() throws Exception {
   }
 
   /**
-   * Test that HLL++ sketch computed in BigQuery can be processed by Beam. Hll 
sketch is computed by
-   * {@code HLL_COUNT.INIT} in BigQuery and read into Beam; the test verifies 
that we can run {@link
-   * HllCount.MergePartial} and {@link HllCount.Extract} on the sketch in Beam 
to get the correct
-   * estimated count.
+   * Test that non-empty HLL++ sketch computed in BigQuery can be processed by 
Beam.
+   *
+   * Hll sketch is computed by {@code HLL_COUNT.INIT} in BigQuery and read 
into Beam; the test
+   * verifies that we can run {@link HllCount.MergePartial} and {@link 
HllCount.Extract} on the
+   * sketch in Beam to get the correct estimated count.
+   */
+  @Test
+  public void testReadNonEmptySketchFromBigQuery() {
+readSketchFromBigQuery(DATA_TABLE_ID_NON_EMPTY, EXPECTED_COUNT_NON_EMPTY);
+  }
+
+  /**
+   * Test that empty HLL++ sketch computed in BigQuery can be processed by 
Beam.
+   *
+   * Hll sketch is computed by {@code HLL_COUNT.INIT} in BigQuery and read 
into Beam; the test
+   * verifies that we can run {@link HllCount.MergePartial} and {@link 
HllCount.Extract} on the
+   * sketch in Beam to get the correct estimated count.
*/
   @Test
-  public void testReadSketchFromBigQuery() {
-String tableSpec = String.format("%s.%s", DATASET_ID, DATA_TABLE_ID);
+  public void testReadEmptySketchFromBigQuery() {
+readSketchFromBigQuery(DATA_TABLE_ID_EMPTY, EXPECTED_COUNT_EMPTY);
+  }
+
+  private void readSketchFromBigQuery(String tableId, Long expectedCount) {
+String tableSpec = String.format("%s.%s", DATASET_ID, tableId);
 String query =
 String.format(
 "SELECT HLL_COUNT.INIT(%s) AS %s FROM %s",
 DATA_FIELD_NAME, QUERY_RESULT_FIELD_NAME, tableSpec);
+
 SerializableFunction parseQueryResultToByteArray =
-(SchemaAndRecord schemaAndRecord) ->
-// BigQuery BYTES type corresponds to Java java.nio.ByteBuffer type
-((ByteBuffer) 
schemaAndRecord.getRecord().get(QUERY_RESULT_FIELD_NAME)).array();
+input -> {
+  // BigQuery BYTES type corresponds to Java java.nio.ByteBuffer type
+  ByteBuffer sketch = (ByteBuffer) 
input.getRecord().get(QUERY_RESULT_FIELD_NAME);
+  if (sketch == null) {
+// Empty sketch is represented by null in BigQuery and by empty 
byte array in Beam
+return new byte[0];
+  } else {
+byte[] result = new byte[sketch.remaining()];
 
 Review comment:
   why not 
   `return sketch.array()`
   as previously? Since we can't be 100% sure that the ByteBuffer is backed by 
an accessible array?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 330562)
Time Spent: 35h  (was: 34h 50m)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 35h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-10-11 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=327185=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-327185
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 12/Oct/19 02:27
Start Date: 12/Oct/19 02:27
Worklog Time Spent: 10m 
  Work Description: robinyqiu commented on issue #9778: [BEAM-7013] Update 
BigQueryHllSketchCompatibilityIT to cover empty sketch cases
URL: https://github.com/apache/beam/pull/9778#issuecomment-541275070
 
 
   See all the 4 test cases pass here: 
https://builds.apache.org/job/beam_PostCommit_Java_PR/237/testReport/org.apache.beam.sdk.extensions.zetasketch/BigQueryHllSketchCompatibilityIT/
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 327185)
Time Spent: 34h 40m  (was: 34.5h)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 34h 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-10-11 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=327140=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-327140
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 11/Oct/19 23:43
Start Date: 11/Oct/19 23:43
Worklog Time Spent: 10m 
  Work Description: robinyqiu commented on issue #9778: [BEAM-7013] Update 
BigQueryHllSketchCompatibilityIT to cover empty sketch cases
URL: https://github.com/apache/beam/pull/9778#issuecomment-541257592
 
 
   Run Java PostCommit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 327140)
Time Spent: 34.5h  (was: 34h 20m)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 34.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-10-11 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=327139=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-327139
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 11/Oct/19 23:41
Start Date: 11/Oct/19 23:41
Worklog Time Spent: 10m 
  Work Description: robinyqiu commented on pull request #9778: [BEAM-7013] 
Update BigQueryHllSketchCompatibilityIT to cover empty sketch cases
URL: https://github.com/apache/beam/pull/9778
 
 
   r: @zfraa
   
   
   
   Post-Commit Tests Status (on master branch)
   

   
   Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark
   --- | --- | --- | --- | --- | --- | --- | ---
   Go | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/)
 | --- | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/)
 | --- | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/)
   Java | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/)
   Python | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Python37/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python37/lastCompletedBuild/)
 | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Py_VR_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Py_VR_Dataflow/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Py_ValCont/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Py_ValCont/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PreCommit_Python2_PVR_Flink_Cron/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PreCommit_Python2_PVR_Flink_Cron/lastCompletedBuild/)[![Build
 

[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-09-11 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=310910=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-310910
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 11/Sep/19 18:23
Start Date: 11/Sep/19 18:23
Worklog Time Spent: 10m 
  Work Description: boyuanzz commented on pull request #9519: [BEAM-7013] 
Use a 0-length byte array to represent empty sketch in HllCount
URL: https://github.com/apache/beam/pull/9519
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 310910)
Time Spent: 34h 10m  (was: 34h)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 34h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-09-10 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=310111=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-310111
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 10/Sep/19 20:53
Start Date: 10/Sep/19 20:53
Worklog Time Spent: 10m 
  Work Description: zfraa commented on pull request #9519: [BEAM-7013] Use 
a 0-length byte array to represent empty sketch in HllCount
URL: https://github.com/apache/beam/pull/9519#discussion_r322957148
 
 

 ##
 File path: 
sdks/java/extensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/HllCountTest.java
 ##
 @@ -302,12 +304,53 @@ public void testMergePartialGlobally_SingletonInput() {
 
   @Test
   @Category(NeedsRunner.class)
-  public void testMergePartialGlobally_EmptyInput() {
+  public void testMergePartialGlobally_SingletonInputEmptySketch() {
 PCollection result =
-p.apply(Create.empty(TypeDescriptor.of(byte[].class)))
+
p.apply(Create.of(EMPTY_SKETCH)).apply(HllCount.MergePartial.globally());
+
+PAssert.thatSingleton(result).isEqualTo(EMPTY_SKETCH);
+p.run();
+  }
+
+  @Test
+  @Category(NeedsRunner.class)
+  public void testMergePartialGlobally_MergeWithEmptySketch() {
+PCollection result =
+p.apply(Create.of(LONGS_SKETCH, 
EMPTY_SKETCH)).apply(HllCount.MergePartial.globally());
+
+PAssert.thatSingleton(result).isEqualTo(LONGS_SKETCH);
+p.run();
+  }
+
+  @Test
+  @Category(NeedsRunner.class)
+  public void testMergePartialGlobally_MergeMultipleEmptySketches() {
+PCollection result =
+p.apply(Create.of(EMPTY_SKETCH, 
EMPTY_SKETCH)).apply(HllCount.MergePartial.globally());
+
+PAssert.thatSingleton(result).isEqualTo(EMPTY_SKETCH);
+p.run();
+  }
+
+  @Test
+  @Category(NeedsRunner.class)
+  public void testMergePartialGlobally_MergeWithSketchOfEmptySet() {
+PCollection result =
+p.apply(Create.of(LONGS_SKETCH, LONGS_SKETCH_OF_EMPTY_SET))
+.apply(HllCount.MergePartial.globally());
+
+PAssert.thatSingleton(result).isEqualTo(LONGS_SKETCH);
+p.run();
+  }
+
+  @Test
+  @Category(NeedsRunner.class)
+  public void testMergePartialGlobally_MergeEmptySketchWithSketchOfEmptySet() {
+PCollection result =
+p.apply(Create.of(EMPTY_SKETCH, LONGS_SKETCH_OF_EMPTY_SET))
 .apply(HllCount.MergePartial.globally());
 
-PAssert.that(result).empty();
+PAssert.thatSingleton(result).isEqualTo(LONGS_SKETCH_OF_EMPTY_SET);
 
 Review comment:
   Good point, SG. 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 310111)
Time Spent: 34h  (was: 33h 50m)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 34h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-09-10 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=310088=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-310088
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 10/Sep/19 20:34
Start Date: 10/Sep/19 20:34
Worklog Time Spent: 10m 
  Work Description: robinyqiu commented on pull request #9519: [BEAM-7013] 
Use a 0-length byte array to represent empty sketch in HllCount
URL: https://github.com/apache/beam/pull/9519#discussion_r322949267
 
 

 ##
 File path: 
sdks/java/extensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/HllCountTest.java
 ##
 @@ -302,12 +304,53 @@ public void testMergePartialGlobally_SingletonInput() {
 
   @Test
   @Category(NeedsRunner.class)
-  public void testMergePartialGlobally_EmptyInput() {
+  public void testMergePartialGlobally_SingletonInputEmptySketch() {
 PCollection result =
-p.apply(Create.empty(TypeDescriptor.of(byte[].class)))
+
p.apply(Create.of(EMPTY_SKETCH)).apply(HllCount.MergePartial.globally());
+
+PAssert.thatSingleton(result).isEqualTo(EMPTY_SKETCH);
+p.run();
+  }
+
+  @Test
+  @Category(NeedsRunner.class)
+  public void testMergePartialGlobally_MergeWithEmptySketch() {
+PCollection result =
+p.apply(Create.of(LONGS_SKETCH, 
EMPTY_SKETCH)).apply(HllCount.MergePartial.globally());
+
+PAssert.thatSingleton(result).isEqualTo(LONGS_SKETCH);
+p.run();
+  }
+
+  @Test
+  @Category(NeedsRunner.class)
+  public void testMergePartialGlobally_MergeMultipleEmptySketches() {
+PCollection result =
+p.apply(Create.of(EMPTY_SKETCH, 
EMPTY_SKETCH)).apply(HllCount.MergePartial.globally());
+
+PAssert.thatSingleton(result).isEqualTo(EMPTY_SKETCH);
+p.run();
+  }
+
+  @Test
+  @Category(NeedsRunner.class)
+  public void testMergePartialGlobally_MergeWithSketchOfEmptySet() {
+PCollection result =
+p.apply(Create.of(LONGS_SKETCH, LONGS_SKETCH_OF_EMPTY_SET))
+.apply(HllCount.MergePartial.globally());
+
+PAssert.thatSingleton(result).isEqualTo(LONGS_SKETCH);
+p.run();
+  }
+
+  @Test
+  @Category(NeedsRunner.class)
+  public void testMergePartialGlobally_MergeEmptySketchWithSketchOfEmptySet() {
+PCollection result =
+p.apply(Create.of(EMPTY_SKETCH, LONGS_SKETCH_OF_EMPTY_SET))
 .apply(HllCount.MergePartial.globally());
 
-PAssert.that(result).empty();
+PAssert.thatSingleton(result).isEqualTo(LONGS_SKETCH_OF_EMPTY_SET);
 
 Review comment:
   > this is testing an implementation detail (there are other valid return 
values here).
   
   Yes. But the result is exposed to user anyway. So I prefer keeping it as it 
is. If we decide to return other values (byte[0]) in the future, we will change 
this unit test as well.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 310088)
Time Spent: 33h 50m  (was: 33h 40m)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 33h 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-09-10 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=310085=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-310085
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 10/Sep/19 20:30
Start Date: 10/Sep/19 20:30
Worklog Time Spent: 10m 
  Work Description: zfraa commented on issue #9519: [BEAM-7013] Use a 
0-length byte array to represent empty sketch in HllCount
URL: https://github.com/apache/beam/pull/9519#issuecomment-530106641
 
 
   Still LGTM! On the test naming with underscores: Yep, zetasketch does this 
-- we follow the principle outlined e.g. here: 
https://osherove.com/blog/2005/4/3/naming-standards-for-unit-tests.html, which 
is also consistent with internal Java Style. But let's follow Beam style here! 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 310085)
Time Spent: 33h 40m  (was: 33.5h)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 33h 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-09-10 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=310084=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-310084
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 10/Sep/19 20:29
Start Date: 10/Sep/19 20:29
Worklog Time Spent: 10m 
  Work Description: robinyqiu commented on pull request #9519: [BEAM-7013] 
Use a 0-length byte array to represent empty sketch in HllCount
URL: https://github.com/apache/beam/pull/9519#discussion_r322947030
 
 

 ##
 File path: 
sdks/java/extensions/zetasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCountMergePartialFn.java
 ##
 @@ -54,10 +54,15 @@ private HllCountMergePartialFn() {}
 return null;
   }
 
+  @Nullable
   @Override
   public HyperLogLogPlusPlus addInput(
   @Nullable HyperLogLogPlusPlus accumulator, byte[] input) {
 
 Review comment:
   @boyuanzz PTAL at the added logging
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 310084)
Time Spent: 33.5h  (was: 33h 20m)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 33.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-09-10 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=310083=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-310083
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 10/Sep/19 20:27
Start Date: 10/Sep/19 20:27
Worklog Time Spent: 10m 
  Work Description: zfraa commented on pull request #9519: [BEAM-7013] Use 
a 0-length byte array to represent empty sketch in HllCount
URL: https://github.com/apache/beam/pull/9519#discussion_r322946215
 
 

 ##
 File path: 
sdks/java/extensions/zetasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCountMergePartialFn.java
 ##
 @@ -54,10 +54,15 @@ private HllCountMergePartialFn() {}
 return null;
   }
 
+  @Nullable
   @Override
   public HyperLogLogPlusPlus addInput(
   @Nullable HyperLogLogPlusPlus accumulator, byte[] input) {
 
 Review comment:
   LGTM! Maybe Boyuan can also have a quick look at the logging? Not familiar 
with what's usual in Beam. 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 310083)
Time Spent: 33h 20m  (was: 33h 10m)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 33h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-09-10 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=310032=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-310032
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 10/Sep/19 18:50
Start Date: 10/Sep/19 18:50
Worklog Time Spent: 10m 
  Work Description: robinyqiu commented on pull request #9519: [BEAM-7013] 
Use a 0-length byte array to represent empty sketch in HllCount
URL: https://github.com/apache/beam/pull/9519#discussion_r322906525
 
 

 ##
 File path: 
sdks/java/extensions/zetasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCountMergePartialFn.java
 ##
 @@ -54,10 +54,15 @@ private HllCountMergePartialFn() {}
 return null;
   }
 
+  @Nullable
   @Override
   public HyperLogLogPlusPlus addInput(
   @Nullable HyperLogLogPlusPlus accumulator, byte[] input) {
 
 Review comment:
   > I would lean towards avoiding user errors, since every error avoided is 
something that users don't need to revise their pipeline over, and is an issue 
that is not escalated to us.
   
   Agreed. Actually I figured out that we can accept nulls and leave a log 
warning to suggest replacement with byte[0]. Made that change. PTAL.
   
   > Also, if users need to filter their input and replace nulls with byte[0], 
is that streamed (resp. folded into another pass over the data) or does it 
result in an extra-pass over the data?
   
   That depends on their pipeline implementation. If users do that in the 
`PTransform` where null is created (e.g. 
[here](https://github.com/robinyqiu/beam/blob/3b6a628c9ad0fbf63b7c1f7d355dbc8cf5219eb2/sdks/java/extensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/BigQueryHllSketchCompatibilityIT.java#L144)
 in BigQueryIO), then it will not result in an extra-pass.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 310032)
Time Spent: 33h 10m  (was: 33h)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 33h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-09-10 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=310025=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-310025
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 10/Sep/19 18:41
Start Date: 10/Sep/19 18:41
Worklog Time Spent: 10m 
  Work Description: robinyqiu commented on pull request #9519: [BEAM-7013] 
Use a 0-length byte array to represent empty sketch in HllCount
URL: https://github.com/apache/beam/pull/9519#discussion_r322902745
 
 

 ##
 File path: 
sdks/java/extensions/zetasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCountMergePartialFn.java
 ##
 @@ -54,10 +54,16 @@ private HllCountMergePartialFn() {}
 return null;
   }
 
+  @Nullable
   @Override
   public HyperLogLogPlusPlus addInput(
   @Nullable HyperLogLogPlusPlus accumulator, byte[] input) {
-if (accumulator == null) {
+if (input == null) {
+  throw new NullPointerException(
 
 Review comment:
   Nice catch! (But this is outdated since I changed the behavior.)
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 310025)
Time Spent: 33h  (was: 32h 50m)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 33h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-09-10 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=310024=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-310024
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 10/Sep/19 18:38
Start Date: 10/Sep/19 18:38
Worklog Time Spent: 10m 
  Work Description: robinyqiu commented on pull request #9519: [BEAM-7013] 
Use a 0-length byte array to represent empty sketch in HllCount
URL: https://github.com/apache/beam/pull/9519#discussion_r322901323
 
 

 ##
 File path: 
sdks/java/extensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/HllCountTest.java
 ##
 @@ -370,4 +464,14 @@ public void testExtractPerKey() {
 .containsInAnyOrder(Arrays.asList(KV.of("k", INTS1_ESTIMATE), 
KV.of("k", INTS2_ESTIMATE)));
 p.run();
   }
+
+  @Test
+  @Category(NeedsRunner.class)
+  public void testExtractPerKey_EmptySketch() {
 
 Review comment:
   Nice catch of the missing test! Done.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 310024)
Time Spent: 32h 50m  (was: 32h 40m)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 32h 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-09-10 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=310021=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-310021
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 10/Sep/19 18:35
Start Date: 10/Sep/19 18:35
Worklog Time Spent: 10m 
  Work Description: robinyqiu commented on pull request #9519: [BEAM-7013] 
Use a 0-length byte array to represent empty sketch in HllCount
URL: https://github.com/apache/beam/pull/9519#discussion_r322899925
 
 

 ##
 File path: 
sdks/java/extensions/zetasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCount.java
 ##
 @@ -346,7 +354,14 @@ private Extract() {}
   @ProcessElement
 
 Review comment:
   Thanks for the suggestion! Done.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 310021)
Time Spent: 32h 40m  (was: 32.5h)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 32h 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-09-10 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=310015=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-310015
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 10/Sep/19 18:25
Start Date: 10/Sep/19 18:25
Worklog Time Spent: 10m 
  Work Description: robinyqiu commented on pull request #9519: [BEAM-7013] 
Use a 0-length byte array to represent empty sketch in HllCount
URL: https://github.com/apache/beam/pull/9519#discussion_r322895610
 
 

 ##
 File path: 
sdks/java/extensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/HllCountTest.java
 ##
 @@ -281,12 +283,12 @@ public void testMergePartialGlobally() {
 
   @Test
   @Category(NeedsRunner.class)
-  public void testMergePartialGlobally_MergeWithSketchForEmptySet() {
+  public void testMergePartialGlobally_EmptyInput() {
 
 Review comment:
   Nice catch! I am following the style in the [zetasketch 
library](https://github.com/google/zetasketch/blob/4aa44e9cb543d766318b919fc6bb359bdf7b0809/javatests/com/google/zetasketch/HyperLogLogPlusPlusTest.java#L97),
 which I think is a google style? But I am not sure if this complies to Beam 
style. So I have changed it.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 310015)
Time Spent: 32.5h  (was: 32h 20m)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 32.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-09-10 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=310002=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-310002
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 10/Sep/19 18:05
Start Date: 10/Sep/19 18:05
Worklog Time Spent: 10m 
  Work Description: robinyqiu commented on pull request #9519: [BEAM-7013] 
Use a 0-length byte array to represent empty sketch in HllCount
URL: https://github.com/apache/beam/pull/9519#discussion_r322886862
 
 

 ##
 File path: 
sdks/java/extensions/zetasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCountInitFn.java
 ##
 @@ -57,6 +57,11 @@ void setPrecision(int precision) {
 this.precision = precision;
   }
 
+  @Override
+  public byte[] defaultValue() {
+return new byte[0];
 
 Review comment:
   Ah I see. You are comparing byte[0] representation with the proto 
representation. Done!
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 310002)
Time Spent: 32h 20m  (was: 32h 10m)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 32h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-09-10 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=309991=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-309991
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 10/Sep/19 17:43
Start Date: 10/Sep/19 17:43
Worklog Time Spent: 10m 
  Work Description: robinyqiu commented on pull request #9519: [BEAM-7013] 
Use a 0-length byte array to represent empty sketch in HllCount
URL: https://github.com/apache/beam/pull/9519#discussion_r322876783
 
 

 ##
 File path: 
sdks/java/extensions/zetasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCount.java
 ##
 @@ -237,10 +240,11 @@ private Builder(HllCountInitFn initFn) {
* PCollection} and returns a {@code PCollection} which 
consists of the HLL++
* sketch computed from the elements in the input {@code PCollection}.
*
-   * Returns an empty output {@code PCollection} if the input {@code 
PCollection} is empty.
+   * Returns a singleton {@code PCollection} with an "empty sketch" 
(0-length byte array) if
+   * the input {@code PCollection} is empty.
 
 Review comment:
   > Or does a perKey aggregation work exactly like a group-by, i.e., it can 
never be associated with an empty aggregation?
   
   Yes. (Under the hood it uses a transform called `GroupByKey`.)
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 309991)
Time Spent: 32h 10m  (was: 32h)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 32h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-09-10 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=309921=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-309921
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 10/Sep/19 15:57
Start Date: 10/Sep/19 15:57
Worklog Time Spent: 10m 
  Work Description: zfraa commented on pull request #9519: [BEAM-7013] Use 
a 0-length byte array to represent empty sketch in HllCount
URL: https://github.com/apache/beam/pull/9519#discussion_r322824226
 
 

 ##
 File path: 
sdks/java/extensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/HllCountTest.java
 ##
 @@ -370,4 +464,14 @@ public void testExtractPerKey() {
 .containsInAnyOrder(Arrays.asList(KV.of("k", INTS1_ESTIMATE), 
KV.of("k", INTS2_ESTIMATE)));
 p.run();
   }
+
+  @Test
+  @Category(NeedsRunner.class)
+  public void testExtractPerKey_EmptySketch() {
 
 Review comment:
   Since the Extract implementations for Global and PerKey are different, would 
test both types of empty sketches for both cases. Otherwise, very nice test 
coverage! :) 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 309921)
Time Spent: 31h 50m  (was: 31h 40m)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 31h 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-09-10 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=309922=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-309922
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 10/Sep/19 15:57
Start Date: 10/Sep/19 15:57
Worklog Time Spent: 10m 
  Work Description: zfraa commented on pull request #9519: [BEAM-7013] Use 
a 0-length byte array to represent empty sketch in HllCount
URL: https://github.com/apache/beam/pull/9519#discussion_r322657271
 
 

 ##
 File path: 
sdks/java/extensions/zetasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCountMergePartialFn.java
 ##
 @@ -54,10 +54,15 @@ private HllCountMergePartialFn() {}
 return null;
   }
 
+  @Nullable
   @Override
   public HyperLogLogPlusPlus addInput(
   @Nullable HyperLogLogPlusPlus accumulator, byte[] input) {
 
 Review comment:
   Let's leave things as they are in this PR, so that this can be merged! 
   
   Whether we should do any follow-ups: I would lean towards avoiding user 
errors, since every error avoided is something that users don't need to revise 
their pipeline over, and is an issue that is not escalated to us. Also, if 
users need to filter their input and replace nulls with byte[0], is that 
streamed (resp. folded into another pass over the data) or does it result in an 
extra-pass over the data? 
   But I don't feel strongly about it and I'll let you make the call, and as 
you say, we can always change it later if there's a need for it. 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 309922)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 31h 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-09-10 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=309919=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-309919
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 10/Sep/19 15:57
Start Date: 10/Sep/19 15:57
Worklog Time Spent: 10m 
  Work Description: zfraa commented on pull request #9519: [BEAM-7013] Use 
a 0-length byte array to represent empty sketch in HllCount
URL: https://github.com/apache/beam/pull/9519#discussion_r322654460
 
 

 ##
 File path: 
sdks/java/extensions/zetasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCountInitFn.java
 ##
 @@ -57,6 +57,11 @@ void setPrecision(int precision) {
 this.precision = precision;
   }
 
+  @Override
+  public byte[] defaultValue() {
+return new byte[0];
 
 Review comment:
   Nits: s/Result/The result
   + would add "because we cannot..., and because it's more compact. 
   Otherwise looks great! 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 309919)
Time Spent: 31h 50m  (was: 31h 40m)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 31h 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-09-10 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=309920=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-309920
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 10/Sep/19 15:57
Start Date: 10/Sep/19 15:57
Worklog Time Spent: 10m 
  Work Description: zfraa commented on pull request #9519: [BEAM-7013] Use 
a 0-length byte array to represent empty sketch in HllCount
URL: https://github.com/apache/beam/pull/9519#discussion_r322760086
 
 

 ##
 File path: 
sdks/java/extensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/HllCountTest.java
 ##
 @@ -302,12 +304,53 @@ public void testMergePartialGlobally_SingletonInput() {
 
   @Test
   @Category(NeedsRunner.class)
-  public void testMergePartialGlobally_EmptyInput() {
+  public void testMergePartialGlobally_SingletonInputEmptySketch() {
 PCollection result =
-p.apply(Create.empty(TypeDescriptor.of(byte[].class)))
+
p.apply(Create.of(EMPTY_SKETCH)).apply(HllCount.MergePartial.globally());
+
+PAssert.thatSingleton(result).isEqualTo(EMPTY_SKETCH);
+p.run();
+  }
+
+  @Test
+  @Category(NeedsRunner.class)
+  public void testMergePartialGlobally_MergeWithEmptySketch() {
+PCollection result =
+p.apply(Create.of(LONGS_SKETCH, 
EMPTY_SKETCH)).apply(HllCount.MergePartial.globally());
+
+PAssert.thatSingleton(result).isEqualTo(LONGS_SKETCH);
+p.run();
+  }
+
+  @Test
+  @Category(NeedsRunner.class)
+  public void testMergePartialGlobally_MergeMultipleEmptySketches() {
+PCollection result =
+p.apply(Create.of(EMPTY_SKETCH, 
EMPTY_SKETCH)).apply(HllCount.MergePartial.globally());
+
+PAssert.thatSingleton(result).isEqualTo(EMPTY_SKETCH);
+p.run();
+  }
+
+  @Test
+  @Category(NeedsRunner.class)
+  public void testMergePartialGlobally_MergeWithSketchOfEmptySet() {
+PCollection result =
+p.apply(Create.of(LONGS_SKETCH, LONGS_SKETCH_OF_EMPTY_SET))
+.apply(HllCount.MergePartial.globally());
+
+PAssert.thatSingleton(result).isEqualTo(LONGS_SKETCH);
+p.run();
+  }
+
+  @Test
+  @Category(NeedsRunner.class)
+  public void testMergePartialGlobally_MergeEmptySketchWithSketchOfEmptySet() {
+PCollection result =
+p.apply(Create.of(EMPTY_SKETCH, LONGS_SKETCH_OF_EMPTY_SET))
 .apply(HllCount.MergePartial.globally());
 
-PAssert.that(result).empty();
+PAssert.thatSingleton(result).isEqualTo(LONGS_SKETCH_OF_EMPTY_SET);
 
 Review comment:
   Nit: this is testing an implementation detail (there are other valid return 
values here). Maybe apply an extract and verify that the result is zero, 
because there, there's only one valid answer? 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 309920)
Time Spent: 31h 50m  (was: 31h 40m)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 31h 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-09-10 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=309924=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-309924
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 10/Sep/19 15:57
Start Date: 10/Sep/19 15:57
Worklog Time Spent: 10m 
  Work Description: zfraa commented on pull request #9519: [BEAM-7013] Use 
a 0-length byte array to represent empty sketch in HllCount
URL: https://github.com/apache/beam/pull/9519#discussion_r322661239
 
 

 ##
 File path: 
sdks/java/extensions/zetasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCount.java
 ##
 @@ -237,10 +240,11 @@ private Builder(HllCountInitFn initFn) {
* PCollection} and returns a {@code PCollection} which 
consists of the HLL++
* sketch computed from the elements in the input {@code PCollection}.
*
-   * Returns an empty output {@code PCollection} if the input {@code 
PCollection} is empty.
+   * Returns a singleton {@code PCollection} with an "empty sketch" 
(0-length byte array) if
+   * the input {@code PCollection} is empty.
 
 Review comment:
   Could you add the same comment to the perKey method below? Or does a perKey 
aggregation work exactly like a group-by, i.e., it can never be associated with 
an empty aggregation? 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 309924)
Time Spent: 32h  (was: 31h 50m)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 32h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-09-10 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=309923=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-309923
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 10/Sep/19 15:57
Start Date: 10/Sep/19 15:57
Worklog Time Spent: 10m 
  Work Description: zfraa commented on pull request #9519: [BEAM-7013] Use 
a 0-length byte array to represent empty sketch in HllCount
URL: https://github.com/apache/beam/pull/9519#discussion_r322656124
 
 

 ##
 File path: 
sdks/java/extensions/zetasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCountMergePartialFn.java
 ##
 @@ -54,10 +54,15 @@ private HllCountMergePartialFn() {}
 return null;
   }
 
+  @Nullable
 
 Review comment:
   Right, same confusion on my side as below (byte[] empty sketch vs. 
HyperLogLogPlusPlus empty sketch). LG! 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 309923)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 31h 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-09-10 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=309918=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-309918
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 10/Sep/19 15:57
Start Date: 10/Sep/19 15:57
Worklog Time Spent: 10m 
  Work Description: zfraa commented on pull request #9519: [BEAM-7013] Use 
a 0-length byte array to represent empty sketch in HllCount
URL: https://github.com/apache/beam/pull/9519#discussion_r322662678
 
 

 ##
 File path: 
sdks/java/extensions/zetasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCount.java
 ##
 @@ -346,7 +354,14 @@ private Extract() {}
   @ProcessElement
 
 Review comment:
   (Comment belongs further up, but I can't comment on the unchanged lines:)
   Would mention the corner case of empty aggregations somewhere in the 
'Extract' javadoc as well. E.g., "When extracting from an empty aggregation 
(i.e., a byte array of length 0), the result returned is 0." 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 309918)
Time Spent: 31h 40m  (was: 31.5h)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 31h 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-09-10 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=309667=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-309667
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 10/Sep/19 09:47
Start Date: 10/Sep/19 09:47
Worklog Time Spent: 10m 
  Work Description: zfraa commented on pull request #9519: [BEAM-7013] Use 
a 0-length byte array to represent empty sketch in HllCount
URL: https://github.com/apache/beam/pull/9519#discussion_r322649846
 
 

 ##
 File path: 
sdks/java/extensions/zetasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCountMergePartialFn.java
 ##
 @@ -94,9 +99,9 @@ private HllCountMergePartialFn() {}
   @Override
   public byte[] extractOutput(@Nullable HyperLogLogPlusPlus accumulator) 
{
 if (accumulator == null) {
-  throw new IllegalStateException(
-  "HllCountMergePartialFn.extractOutput() should not be called on a 
null accumulator.");
+  return new byte[0];
 
 Review comment:
   Got it, thanks! 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 309667)
Time Spent: 31.5h  (was: 31h 20m)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 31.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-09-10 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=309660=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-309660
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 10/Sep/19 09:31
Start Date: 10/Sep/19 09:31
Worklog Time Spent: 10m 
  Work Description: zfraa commented on pull request #9519: [BEAM-7013] Use 
a 0-length byte array to represent empty sketch in HllCount
URL: https://github.com/apache/beam/pull/9519#discussion_r322642519
 
 

 ##
 File path: 
sdks/java/extensions/zetasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCountMergePartialFn.java
 ##
 @@ -54,10 +54,15 @@ private HllCountMergePartialFn() {}
 return null;
   }
 
+  @Nullable
   @Override
   public HyperLogLogPlusPlus addInput(
   @Nullable HyperLogLogPlusPlus accumulator, byte[] input) {
-if (accumulator == null) {
+if (input == null) {
+  throw new NullPointerException("Null is not a valid sketch.");
+} else if (input.length == 0) {
+  return accumulator;
+} else if (accumulator == null) {
 
 Review comment:
   Right, sorry for missing that :) (Can be resolved). 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 309660)
Time Spent: 31h 20m  (was: 31h 10m)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 31h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-09-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=309407=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-309407
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 10/Sep/19 00:31
Start Date: 10/Sep/19 00:31
Worklog Time Spent: 10m 
  Work Description: boyuanzz commented on pull request #9519: [BEAM-7013] 
Use a 0-length byte array to represent empty sketch in HllCount
URL: https://github.com/apache/beam/pull/9519#discussion_r322504225
 
 

 ##
 File path: 
sdks/java/extensions/zetasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCountMergePartialFn.java
 ##
 @@ -54,10 +54,16 @@ private HllCountMergePartialFn() {}
 return null;
   }
 
+  @Nullable
   @Override
   public HyperLogLogPlusPlus addInput(
   @Nullable HyperLogLogPlusPlus accumulator, byte[] input) {
-if (accumulator == null) {
+if (input == null) {
+  throw new NullPointerException(
 
 Review comment:
   Consider using `checkArgument`?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 309407)
Time Spent: 31h  (was: 30h 50m)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 31h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-09-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=309408=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-309408
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 10/Sep/19 00:31
Start Date: 10/Sep/19 00:31
Worklog Time Spent: 10m 
  Work Description: boyuanzz commented on pull request #9519: [BEAM-7013] 
Use a 0-length byte array to represent empty sketch in HllCount
URL: https://github.com/apache/beam/pull/9519#discussion_r322505023
 
 

 ##
 File path: 
sdks/java/extensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/HllCountTest.java
 ##
 @@ -281,12 +283,12 @@ public void testMergePartialGlobally() {
 
   @Test
   @Category(NeedsRunner.class)
-  public void testMergePartialGlobally_MergeWithSketchForEmptySet() {
+  public void testMergePartialGlobally_EmptyInput() {
 
 Review comment:
   emmm...I don't think underscore `_` is a naming convention for non const 
thing. Maybe consider "With"?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 309408)
Time Spent: 31h 10m  (was: 31h)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 31h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-09-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=309265=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-309265
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 09/Sep/19 21:30
Start Date: 09/Sep/19 21:30
Worklog Time Spent: 10m 
  Work Description: robinyqiu commented on issue #9519: [BEAM-7013] Use a 
0-length byte array to represent empty sketch in HllCount
URL: https://github.com/apache/beam/pull/9519#issuecomment-529675162
 
 
   Hi @zfraa , I think the test cases you mentioned above are already covered!
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 309265)
Time Spent: 30h 50m  (was: 30h 40m)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 30h 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-09-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=309259=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-309259
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 09/Sep/19 21:19
Start Date: 09/Sep/19 21:19
Worklog Time Spent: 10m 
  Work Description: robinyqiu commented on pull request #9519: [BEAM-7013] 
Use a 0-length byte array to represent empty sketch in HllCount
URL: https://github.com/apache/beam/pull/9519#discussion_r322440348
 
 

 ##
 File path: 
sdks/java/extensions/zetasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCountMergePartialFn.java
 ##
 @@ -54,10 +54,15 @@ private HllCountMergePartialFn() {}
 return null;
   }
 
+  @Nullable
   @Override
   public HyperLogLogPlusPlus addInput(
   @Nullable HyperLogLogPlusPlus accumulator, byte[] input) {
 
 Review comment:
   The `@Nullable` annotation is on the `accumulator` parameter, which can be 
null and we are handling that properly without throwing an exception.
   
   > tl;dr: why not handle nulls instead of throwing?
   
   There are pros and cons of supporting nulls:
   * The pro is that we can save the users from exceptions, as you mentioned
   * The cons are 1) then we will have two different representations for 
''empty sketch"; and 2) I feel like if we accept nulls as input then we are 
encouraging users to produce nullable output (and use `NullableCoder`) from its 
upstream transform, which is more costly in terms of encoding/decoding and more 
error prone.
   
   Currently I slightly prefer not accepting nulls as "empty sketches". What is 
your opinion? A good thing if we keep the implementation as it is: we can 
always change it later to support nulls and it will be backwards compatible 
(but we cannot go the other way).
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 309259)
Time Spent: 30h 40m  (was: 30.5h)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 30h 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-09-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=309253=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-309253
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 09/Sep/19 21:19
Start Date: 09/Sep/19 21:19
Worklog Time Spent: 10m 
  Work Description: robinyqiu commented on pull request #9519: [BEAM-7013] 
Use a 0-length byte array to represent empty sketch in HllCount
URL: https://github.com/apache/beam/pull/9519#discussion_r322341428
 
 

 ##
 File path: 
sdks/java/extensions/zetasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCountMergePartialFn.java
 ##
 @@ -54,10 +54,15 @@ private HllCountMergePartialFn() {}
 return null;
   }
 
+  @Nullable
   @Override
   public HyperLogLogPlusPlus addInput(
   @Nullable HyperLogLogPlusPlus accumulator, byte[] input) {
-if (accumulator == null) {
+if (input == null) {
+  throw new NullPointerException("Null is not a valid sketch.");
+} else if (input.length == 0) {
+  return accumulator;
+} else if (accumulator == null) {
 
 Review comment:
   The `accumulator` is of type `HyperLogLogPlusPlus` but not `byte[]` so 
I don't think we can do that here. (And we cannot use `byte[]` for accumulator 
because of the cost of serialization/deserialization.)
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 309253)
Time Spent: 30h 20m  (was: 30h 10m)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 30h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-09-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=309257=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-309257
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 09/Sep/19 21:19
Start Date: 09/Sep/19 21:19
Worklog Time Spent: 10m 
  Work Description: robinyqiu commented on pull request #9519: [BEAM-7013] 
Use a 0-length byte array to represent empty sketch in HllCount
URL: https://github.com/apache/beam/pull/9519#discussion_r322397926
 
 

 ##
 File path: 
sdks/java/extensions/zetasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCountMergePartialFn.java
 ##
 @@ -54,10 +54,15 @@ private HllCountMergePartialFn() {}
 return null;
   }
 
+  @Nullable
 
 Review comment:
   The return value of this function is an accumulator of type 
`HyperLogLogPlusPlus` and it can be `null`. This annotation was missing 
from the last PR so I am adding it back here.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 309257)
Time Spent: 30.5h  (was: 30h 20m)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 30.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-09-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=309254=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-309254
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 09/Sep/19 21:19
Start Date: 09/Sep/19 21:19
Worklog Time Spent: 10m 
  Work Description: robinyqiu commented on pull request #9519: [BEAM-7013] 
Use a 0-length byte array to represent empty sketch in HllCount
URL: https://github.com/apache/beam/pull/9519#discussion_r322342971
 
 

 ##
 File path: 
sdks/java/extensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/HllCountTest.java
 ##
 @@ -86,13 +88,6 @@
 LONGS_SKETCH = hll.serializeToByteArray();
   }
 
-  private static final byte[] LONGS_EMPTY_SKETCH;
-
-  static {
-HyperLogLogPlusPlus hll = new 
HyperLogLogPlusPlus.Builder().buildForLongs();
-LONGS_EMPTY_SKETCH = hll.serializeToByteArray();
-  }
-
 
 Review comment:
   The current implementation can handles this case. I agree that keeping tests 
for this case is valuable, so I have added it back and added a couple more. :)
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 309254)
Time Spent: 30h 20m  (was: 30h 10m)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 30h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-09-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=309256=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-309256
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 09/Sep/19 21:19
Start Date: 09/Sep/19 21:19
Worklog Time Spent: 10m 
  Work Description: robinyqiu commented on pull request #9519: [BEAM-7013] 
Use a 0-length byte array to represent empty sketch in HllCount
URL: https://github.com/apache/beam/pull/9519#discussion_r322400143
 
 

 ##
 File path: 
sdks/java/extensions/zetasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCountMergePartialFn.java
 ##
 @@ -94,9 +99,9 @@ private HllCountMergePartialFn() {}
   @Override
   public byte[] extractOutput(@Nullable HyperLogLogPlusPlus accumulator) 
{
 if (accumulator == null) {
-  throw new IllegalStateException(
-  "HllCountMergePartialFn.extractOutput() should not be called on a 
null accumulator.");
+  return new byte[0];
 
 Review comment:
   The reason is that the superclass `CombineFn` has a default 
[implementation](https://github.com/robinyqiu/beam/blob/3b6a628c9ad0fbf63b7c1f7d355dbc8cf5219eb2/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/Combine.java#L455-L458):
   ```
   @Override
   public OutputT defaultValue() {
 return extractOutput(createAccumulator());
   }
   ```
   
   For `MergePartial`, this implementation already gives the correct result: 
`createAccumulator()` returns `null` and `extractOutput(null)` returns `new 
byte[0]`.
   
   But for `Init`, the output of this implementation will be a valid empty 
sketch for each type (e.g. `LONGS_PROTO_OF_EMPTY_SKETCH` as you mentioned in 
the next comment), which is not what we want, so we need to override it there.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 309256)
Time Spent: 30.5h  (was: 30h 20m)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 30.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-09-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=309258=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-309258
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 09/Sep/19 21:19
Start Date: 09/Sep/19 21:19
Worklog Time Spent: 10m 
  Work Description: robinyqiu commented on pull request #9519: [BEAM-7013] 
Use a 0-length byte array to represent empty sketch in HllCount
URL: https://github.com/apache/beam/pull/9519#discussion_r322459772
 
 

 ##
 File path: 
sdks/java/extensions/zetasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCountInitFn.java
 ##
 @@ -57,6 +57,11 @@ void setPrecision(int precision) {
 this.precision = precision;
   }
 
+  @Override
+  public byte[] defaultValue() {
+return new byte[0];
 
 Review comment:
   Thanks for the suggestion. Done!
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 309258)
Time Spent: 30h 40m  (was: 30.5h)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 30h 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-09-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=309255=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-309255
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 09/Sep/19 21:19
Start Date: 09/Sep/19 21:19
Worklog Time Spent: 10m 
  Work Description: robinyqiu commented on pull request #9519: [BEAM-7013] 
Use a 0-length byte array to represent empty sketch in HllCount
URL: https://github.com/apache/beam/pull/9519#discussion_r322446265
 
 

 ##
 File path: 
sdks/java/extensions/zetasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCount.java
 ##
 @@ -346,7 +348,13 @@ private Extract() {}
   @ProcessElement
   public void processElement(
   @Element byte[] sketch, OutputReceiver receiver) {
-
receiver.output(HyperLogLogPlusPlus.forProto(sketch).result());
+if (sketch == null) {
+  throw new NullPointerException("Null is not a valid 
sketch.");
 
 Review comment:
   This error message is much clearer! Done! (Replied to your comment about 
supporting null in the next thread.)
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 309255)
Time Spent: 30.5h  (was: 30h 20m)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 30.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-09-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=309147=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-309147
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 09/Sep/19 18:43
Start Date: 09/Sep/19 18:43
Worklog Time Spent: 10m 
  Work Description: robinyqiu commented on pull request #9447: [WIP] 
[BEAM-7013] Handle null input in HllCount.MergePartial and HllCount.Extract
URL: https://github.com/apache/beam/pull/9447
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 309147)
Time Spent: 30h 10m  (was: 30h)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 30h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-09-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=309036=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-309036
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 09/Sep/19 16:22
Start Date: 09/Sep/19 16:22
Worklog Time Spent: 10m 
  Work Description: zfraa commented on pull request #9519: [BEAM-7013] Use 
a 0-length byte array to represent empty sketch in HllCount
URL: https://github.com/apache/beam/pull/9519#discussion_r322251413
 
 

 ##
 File path: 
sdks/java/extensions/zetasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCount.java
 ##
 @@ -346,7 +348,13 @@ private Extract() {}
   @ProcessElement
   public void processElement(
   @Element byte[] sketch, OutputReceiver receiver) {
-
receiver.output(HyperLogLogPlusPlus.forProto(sketch).result());
+if (sketch == null) {
+  throw new NullPointerException("Null is not a valid 
sketch.");
 
 Review comment:
   [Edit: see comment about @Nullable below first before making any changes to 
error messaging]
   How about (for the error message): "Expected a valid sketch or an empty byte 
array (for empty sketches), but found null"? 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 309036)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 30h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-09-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=309035=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-309035
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 09/Sep/19 16:22
Start Date: 09/Sep/19 16:22
Worklog Time Spent: 10m 
  Work Description: zfraa commented on pull request #9519: [BEAM-7013] Use 
a 0-length byte array to represent empty sketch in HllCount
URL: https://github.com/apache/beam/pull/9519#discussion_r322333589
 
 

 ##
 File path: 
sdks/java/extensions/zetasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCountMergePartialFn.java
 ##
 @@ -94,9 +99,9 @@ private HllCountMergePartialFn() {}
   @Override
   public byte[] extractOutput(@Nullable HyperLogLogPlusPlus accumulator) 
{
 if (accumulator == null) {
-  throw new IllegalStateException(
-  "HllCountMergePartialFn.extractOutput() should not be called on a 
null accumulator.");
+  return new byte[0];
 
 Review comment:
   We don't need a defaultValue() implementation in this class because the 
accumulator is different and can encode "we haven't seen any input yet..." -- 
correct? (Ideally, we'd have more symmetry between Init and MergePartial, but 
that's a detail). 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 309035)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 30h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-09-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=309034=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-309034
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 09/Sep/19 16:22
Start Date: 09/Sep/19 16:22
Worklog Time Spent: 10m 
  Work Description: zfraa commented on pull request #9519: [BEAM-7013] Use 
a 0-length byte array to represent empty sketch in HllCount
URL: https://github.com/apache/beam/pull/9519#discussion_r322267553
 
 

 ##
 File path: 
sdks/java/extensions/zetasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCountMergePartialFn.java
 ##
 @@ -54,10 +54,15 @@ private HllCountMergePartialFn() {}
 return null;
   }
 
+  @Nullable
   @Override
   public HyperLogLogPlusPlus addInput(
   @Nullable HyperLogLogPlusPlus accumulator, byte[] input) {
-if (accumulator == null) {
+if (input == null) {
+  throw new NullPointerException("Null is not a valid sketch.");
+} else if (input.length == 0) {
+  return accumulator;
+} else if (accumulator == null) {
 
 Review comment:
   Did you consider to change the "empty accumulator" representation from null 
to byte[0] as well? Pros/Cons? 
   (It might be conceptually easier to just have one representation for empty 
sketches -- the same internally and externally; but I don't see any large 
benefits to doing the change, so...) 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 309034)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 30h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-09-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=309032=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-309032
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 09/Sep/19 16:22
Start Date: 09/Sep/19 16:22
Worklog Time Spent: 10m 
  Work Description: zfraa commented on pull request #9519: [BEAM-7013] Use 
a 0-length byte array to represent empty sketch in HllCount
URL: https://github.com/apache/beam/pull/9519#discussion_r322268995
 
 

 ##
 File path: 
sdks/java/extensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/HllCountTest.java
 ##
 @@ -86,13 +88,6 @@
 LONGS_SKETCH = hll.serializeToByteArray();
   }
 
-  private static final byte[] LONGS_EMPTY_SKETCH;
-
-  static {
-HyperLogLogPlusPlus hll = new 
HyperLogLogPlusPlus.Builder().buildForLongs();
-LONGS_EMPTY_SKETCH = hll.serializeToByteArray();
-  }
-
 
 Review comment:
   We still need to handle this case as well! I.e., some empty sketches will be 
represented as byte[0], but not all. => Would keep a few tests with this, maybe 
give it a more specific name (LONGS_PROTO_OF_EMPTY_SKETCH, and the one above 
ZERO_BYTES_EMPTY_SKETCH..?)
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 309032)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 30h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-09-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=309030=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-309030
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 09/Sep/19 16:22
Start Date: 09/Sep/19 16:22
Worklog Time Spent: 10m 
  Work Description: zfraa commented on pull request #9519: [BEAM-7013] Use 
a 0-length byte array to represent empty sketch in HllCount
URL: https://github.com/apache/beam/pull/9519#discussion_r322250509
 
 

 ##
 File path: 
sdks/java/extensions/zetasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCount.java
 ##
 @@ -346,7 +348,13 @@ private Extract() {}
   @ProcessElement
   public void processElement(
   @Element byte[] sketch, OutputReceiver receiver) {
-
receiver.output(HyperLogLogPlusPlus.forProto(sketch).result());
+if (sketch == null) {
+  throw new NullPointerException("Null is not a valid 
sketch.");
+} else if (sketch.length == 0) {
+  receiver.output(0L);
 
 Review comment:
   Yay, nice! :D
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 309030)
Time Spent: 30h  (was: 29h 50m)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 30h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-09-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=309031=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-309031
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 09/Sep/19 16:22
Start Date: 09/Sep/19 16:22
Worklog Time Spent: 10m 
  Work Description: zfraa commented on pull request #9519: [BEAM-7013] Use 
a 0-length byte array to represent empty sketch in HllCount
URL: https://github.com/apache/beam/pull/9519#discussion_r322264216
 
 

 ##
 File path: 
sdks/java/extensions/zetasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCountMergePartialFn.java
 ##
 @@ -54,10 +54,15 @@ private HllCountMergePartialFn() {}
 return null;
   }
 
+  @Nullable
   @Override
   public HyperLogLogPlusPlus addInput(
   @Nullable HyperLogLogPlusPlus accumulator, byte[] input) {
 
 Review comment:
   tl;dr: why not handle nulls instead of throwing? 
   
   I didn't find any sources on the exact implied semantics of @Nullable, but I 
would tend to assume if a parameter is annotated with @Nullable, the method 
handles it benignly if it is actually null, vs. throwing an exception.
   I would do one of the following two things: 
   - Either remove the @Nullable annotation and keep throwing below; (again, 
not feeling strongly about this) 
   - Or -- I think we can safely assume that if a null is passed, it's supposed 
to be an empty sketch: maybe a BQ sketch that made it through importing without 
conversion, ... . We have the means to support this smoothly by just treating 
nulls like byte[0] -- why not do this and save the users some exceptions? 
   (would need to be consistently across methods)
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 309031)
Time Spent: 30h  (was: 29h 50m)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 30h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-09-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=309029=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-309029
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 09/Sep/19 16:22
Start Date: 09/Sep/19 16:22
Worklog Time Spent: 10m 
  Work Description: zfraa commented on pull request #9519: [BEAM-7013] Use 
a 0-length byte array to represent empty sketch in HllCount
URL: https://github.com/apache/beam/pull/9519#discussion_r322257693
 
 

 ##
 File path: 
sdks/java/extensions/zetasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCountMergePartialFn.java
 ##
 @@ -54,10 +54,15 @@ private HllCountMergePartialFn() {}
 return null;
   }
 
+  @Nullable
 
 Review comment:
   Why nullable if we never return null? Might be from a previous version of 
this PR? 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 309029)
Time Spent: 29h 50m  (was: 29h 40m)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 29h 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-09-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=309033=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-309033
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 09/Sep/19 16:22
Start Date: 09/Sep/19 16:22
Worklog Time Spent: 10m 
  Work Description: zfraa commented on pull request #9519: [BEAM-7013] Use 
a 0-length byte array to represent empty sketch in HllCount
URL: https://github.com/apache/beam/pull/9519#discussion_r322257094
 
 

 ##
 File path: 
sdks/java/extensions/zetasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCountInitFn.java
 ##
 @@ -57,6 +57,11 @@ void setPrecision(int precision) {
 this.precision = precision;
   }
 
+  @Override
+  public byte[] defaultValue() {
+return new byte[0];
 
 Review comment:
   Maybe add an implementation comment, sth like: 
   "An empty aggregation is represented by an empty byte[], since we cannot 
create sketches without knowing their type. byte[] is space-efficient, but 
safer than null. As opposed to returning an empty PCollection, it allows us to 
return '0' when extracting from the sketch." 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 309033)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 30h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-09-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=308657=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-308657
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 09/Sep/19 07:17
Start Date: 09/Sep/19 07:17
Worklog Time Spent: 10m 
  Work Description: robinyqiu commented on pull request #9519: [BEAM-7013] 
Use a 0-length byte array to represent empty sketch in HllCount
URL: https://github.com/apache/beam/pull/9519
 
 
   r: @zfraa
   
   
   
   Thank you for your contribution! Follow this checklist to help us 
incorporate your contribution quickly and easily:
   
- [ ] [**Choose 
reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and 
mention them in a comment (`R: @username`).
- [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in 
ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA 
issue, if applicable. This will automatically link the pull request to the 
issue.
- [ ] If this contribution is large, please file an Apache [Individual 
Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   Post-Commit Tests Status (on master branch)
   

   
   Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark
   --- | --- | --- | --- | --- | --- | --- | ---
   Go | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/)
 | --- | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/)
 | --- | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/)
   Java | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/)
   Python | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/)[![Build
 

[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-08-30 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=304309=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-304309
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 30/Aug/19 15:20
Start Date: 30/Aug/19 15:20
Worklog Time Spent: 10m 
  Work Description: zfraa commented on pull request #9447: [WIP] 
[BEAM-7013] Handle null input in HllCount.MergePartial and HllCount.Extract
URL: https://github.com/apache/beam/pull/9447#discussion_r319555345
 
 

 ##
 File path: 
sdks/java/extensions/zetasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCount.java
 ##
 @@ -346,7 +346,9 @@ private Extract() {}
   @ProcessElement
   public void processElement(
   @Element byte[] sketch, OutputReceiver receiver) {
-
receiver.output(HyperLogLogPlusPlus.forProto(sketch).result());
+Long result =
+(sketch == null) ? 0L : 
HyperLogLogPlusPlus.forProto(sketch).result();
 
 Review comment:
   Would it make sense to document the nullability of the sketch param 
somewhere -- method doc or @nullable annotation on param, ... 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 304309)
Time Spent: 29h 10m  (was: 29h)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 29h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-08-30 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=304311=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-304311
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 30/Aug/19 15:20
Start Date: 30/Aug/19 15:20
Worklog Time Spent: 10m 
  Work Description: zfraa commented on pull request #9447: [WIP] 
[BEAM-7013] Handle null input in HllCount.MergePartial and HllCount.Extract
URL: https://github.com/apache/beam/pull/9447#discussion_r319557035
 
 

 ##
 File path: 
sdks/java/extensions/zetasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCountMergePartialFn.java
 ##
 @@ -57,6 +57,7 @@ private HllCountMergePartialFn() {}
   @Override
   public HyperLogLogPlusPlus addInput(
   @Nullable HyperLogLogPlusPlus accumulator, byte[] input) {
+if (input == null) return accumulator;
 
 Review comment:
   Also mark output as nullable now. 
   Can we deal with a null output of addInput(..) throughout? 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 304311)
Time Spent: 29.5h  (was: 29h 20m)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 29.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-08-30 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=304310=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-304310
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 30/Aug/19 15:20
Start Date: 30/Aug/19 15:20
Worklog Time Spent: 10m 
  Work Description: zfraa commented on pull request #9447: [WIP] 
[BEAM-7013] Handle null input in HllCount.MergePartial and HllCount.Extract
URL: https://github.com/apache/beam/pull/9447#discussion_r319559575
 
 

 ##
 File path: 
sdks/java/extensions/zetasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCountMergePartialFn.java
 ##
 @@ -94,9 +95,9 @@ private HllCountMergePartialFn() {}
   @Override
   public byte[] extractOutput(@Nullable HyperLogLogPlusPlus accumulator) 
{
 if (accumulator == null) {
-  throw new IllegalStateException(
-  "HllCountMergePartialFn.extractOutput() should not be called on a 
null accumulator.");
+  return null;
 
 Review comment:
   same here (annotation + is this handled well everywhere downstream?)
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 304310)
Time Spent: 29h 20m  (was: 29h 10m)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 29h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-08-30 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=304305=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-304305
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 30/Aug/19 14:54
Start Date: 30/Aug/19 14:54
Worklog Time Spent: 10m 
  Work Description: zfraa commented on pull request #9144: [BEAM-7013] 
Integrating ZetaSketch's HLL++ algorithm with Beam
URL: https://github.com/apache/beam/pull/9144#discussion_r319549589
 
 

 ##
 File path: 
sdks/java/extensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/HllCountTest.java
 ##
 @@ -0,0 +1,373 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.beam.sdk.extensions.zetasketch;
+
+import com.google.zetasketch.HyperLogLogPlusPlus;
+import com.google.zetasketch.shaded.com.google.protobuf.ByteString;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.List;
+import org.apache.beam.sdk.Pipeline.PipelineExecutionException;
+import org.apache.beam.sdk.testing.NeedsRunner;
+import org.apache.beam.sdk.testing.PAssert;
+import org.apache.beam.sdk.testing.TestPipeline;
+import org.apache.beam.sdk.transforms.Create;
+import org.apache.beam.sdk.values.KV;
+import org.apache.beam.sdk.values.PCollection;
+import org.apache.beam.sdk.values.TypeDescriptor;
+import org.junit.Rule;
+import org.junit.Test;
+import org.junit.experimental.categories.Category;
+import org.junit.rules.ExpectedException;
+import org.junit.runner.RunWith;
+import org.junit.runners.JUnit4;
+
+/** Tests for {@link HllCount}. */
+@RunWith(JUnit4.class)
+public class HllCountTest {
+
+  @Rule public final transient TestPipeline p = TestPipeline.create();
+  @Rule public transient ExpectedException thrown = ExpectedException.none();
+
+  // Integer
+  private static final List INTS1 = Arrays.asList(1, 2, 3, 3, 1, 4);
+  private static final byte[] INTS1_SKETCH;
+  private static final Long INTS1_ESTIMATE;
+
+  static {
+HyperLogLogPlusPlus hll = new 
HyperLogLogPlusPlus.Builder().buildForIntegers();
+INTS1.forEach(hll::add);
+INTS1_SKETCH = hll.serializeToByteArray();
+INTS1_ESTIMATE = hll.longResult();
+  }
+
+  private static final List INTS2 = Arrays.asList(3, 3, 3, 3);
+  private static final byte[] INTS2_SKETCH;
+  private static final Long INTS2_ESTIMATE;
+
+  static {
+HyperLogLogPlusPlus hll = new 
HyperLogLogPlusPlus.Builder().buildForIntegers();
+INTS2.forEach(hll::add);
+INTS2_SKETCH = hll.serializeToByteArray();
+INTS2_ESTIMATE = hll.longResult();
+  }
+
+  private static final byte[] INTS1_INTS2_SKETCH;
+
+  static {
+HyperLogLogPlusPlus hll = HyperLogLogPlusPlus.forProto(INTS1_SKETCH);
+hll.merge(INTS2_SKETCH);
+INTS1_INTS2_SKETCH = hll.serializeToByteArray();
+  }
+
+  // Long
+  private static final List LONGS = Collections.singletonList(1L);
+  private static final byte[] LONGS_SKETCH;
+
+  static {
+HyperLogLogPlusPlus hll = new 
HyperLogLogPlusPlus.Builder().buildForLongs();
+LONGS.forEach(hll::add);
+LONGS_SKETCH = hll.serializeToByteArray();
+  }
+
+  private static final byte[] LONGS_EMPTY_SKETCH;
+
+  static {
+HyperLogLogPlusPlus hll = new 
HyperLogLogPlusPlus.Builder().buildForLongs();
+LONGS_EMPTY_SKETCH = hll.serializeToByteArray();
+  }
+
+  // String
+  private static final List STRINGS = Arrays.asList("s1", "s2", "s1", 
"s2");
+  private static final byte[] STRINGS_SKETCH;
+
+  static {
+HyperLogLogPlusPlus hll = new 
HyperLogLogPlusPlus.Builder().buildForStrings();
+STRINGS.forEach(hll::add);
+STRINGS_SKETCH = hll.serializeToByteArray();
+  }
+
+  private static final int TEST_PRECISION = 20;
+  private static final byte[] STRINGS_SKETCH_TEST_PRECISION;
+
+  static {
+HyperLogLogPlusPlus hll =
+new 
HyperLogLogPlusPlus.Builder().normalPrecision(TEST_PRECISION).buildForStrings();
+STRINGS.forEach(hll::add);
+STRINGS_SKETCH_TEST_PRECISION = hll.serializeToByteArray();
+  }
+
+  // Bytes
+  private static final byte[] BYTES0 = {(byte) 0x1, (byte) 

[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-08-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=302178=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-302178
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 27/Aug/19 16:35
Start Date: 27/Aug/19 16:35
Worklog Time Spent: 10m 
  Work Description: boyuanzz commented on pull request #9144: [BEAM-7013] 
Integrating ZetaSketch's HLL++ algorithm with Beam
URL: https://github.com/apache/beam/pull/9144
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 302178)
Time Spent: 28h 50m  (was: 28h 40m)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 28h 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-08-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=301562=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-301562
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 26/Aug/19 22:37
Start Date: 26/Aug/19 22:37
Worklog Time Spent: 10m 
  Work Description: robinyqiu commented on pull request #9144: [BEAM-7013] 
Integrating ZetaSketch's HLL++ algorithm with Beam
URL: https://github.com/apache/beam/pull/9144#discussion_r317826798
 
 

 ##
 File path: 
sdks/java/extensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/BigQueryHllSketchCompatibilityIT.java
 ##
 @@ -49,26 +57,74 @@
 @RunWith(JUnit4.class)
 public class BigQueryHllSketchCompatibilityIT {
 
-  private static final String DATASET_NAME = "zetasketch_compatibility_test";
+  private static final String APP_NAME;
+  private static final String PROJECT_ID;
+  private static final String DATASET_ID;
 
-  // Table for testReadSketchFromBigQuery()
+  private static final List TEST_DATA =
+  Arrays.asList("Apple", "Orange", "Banana", "Orange");
+
+  // Data Table: used by testReadSketchFromBigQuery())
   // Schema: only one STRING field named "data".
   // Content: prepopulated with 4 rows: "Apple", "Orange", "Banana", "Orange"
-  private static final String DATA_TABLE_NAME = "hll_data";
+  private static final String DATA_TABLE_ID = "hll_data";
   private static final String DATA_FIELD_NAME = "data";
+  private static final String DATA_FIELD_TYPE = "STRING";
   private static final String QUERY_RESULT_FIELD_NAME = "sketch";
   private static final Long EXPECTED_COUNT = 3L;
 
-  // Table for testWriteSketchToBigQuery()
+  // Sketch Table: used by testWriteSketchToBigQuery()
   // Schema: only one BYTES field named "sketch".
   // Content: will be overridden by the sketch computed by the test pipeline 
each time the test runs
-  private static final String SKETCH_TABLE_NAME = "hll_sketch";
+  private static final String SKETCH_TABLE_ID = "hll_sketch";
   private static final String SKETCH_FIELD_NAME = "sketch";
-  private static final List TEST_DATA =
-  Arrays.asList("Apple", "Orange", "Banana", "Orange");
+  private static final String SKETCH_FIELD_TYPE = "BYTES";
   // SHA-1 hash of string "[3]", the string representation of a row that has 
only one field 3 in it
   private static final String EXPECTED_CHECKSUM = 
"f1e31df9806ce94c5bdbbfff9608324930f4d3f1";
 
+  static {
+ApplicationNameOptions options =
+TestPipeline.testingPipelineOptions().as(ApplicationNameOptions.class);
+APP_NAME = options.getAppName();
+PROJECT_ID = options.as(GcpOptions.class).getProject();
+DATASET_ID = String.format("zetasketch_%tY_% A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 28h 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-08-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=301528=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-301528
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 26/Aug/19 21:35
Start Date: 26/Aug/19 21:35
Worklog Time Spent: 10m 
  Work Description: boyuanzz commented on pull request #9144: [BEAM-7013] 
Integrating ZetaSketch's HLL++ algorithm with Beam
URL: https://github.com/apache/beam/pull/9144#discussion_r317809353
 
 

 ##
 File path: 
sdks/java/extensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/BigQueryHllSketchCompatibilityIT.java
 ##
 @@ -49,26 +57,74 @@
 @RunWith(JUnit4.class)
 public class BigQueryHllSketchCompatibilityIT {
 
-  private static final String DATASET_NAME = "zetasketch_compatibility_test";
+  private static final String APP_NAME;
+  private static final String PROJECT_ID;
+  private static final String DATASET_ID;
 
-  // Table for testReadSketchFromBigQuery()
+  private static final List TEST_DATA =
+  Arrays.asList("Apple", "Orange", "Banana", "Orange");
+
+  // Data Table: used by testReadSketchFromBigQuery())
   // Schema: only one STRING field named "data".
   // Content: prepopulated with 4 rows: "Apple", "Orange", "Banana", "Orange"
-  private static final String DATA_TABLE_NAME = "hll_data";
+  private static final String DATA_TABLE_ID = "hll_data";
   private static final String DATA_FIELD_NAME = "data";
+  private static final String DATA_FIELD_TYPE = "STRING";
   private static final String QUERY_RESULT_FIELD_NAME = "sketch";
   private static final Long EXPECTED_COUNT = 3L;
 
-  // Table for testWriteSketchToBigQuery()
+  // Sketch Table: used by testWriteSketchToBigQuery()
   // Schema: only one BYTES field named "sketch".
   // Content: will be overridden by the sketch computed by the test pipeline 
each time the test runs
-  private static final String SKETCH_TABLE_NAME = "hll_sketch";
+  private static final String SKETCH_TABLE_ID = "hll_sketch";
   private static final String SKETCH_FIELD_NAME = "sketch";
-  private static final List TEST_DATA =
-  Arrays.asList("Apple", "Orange", "Banana", "Orange");
+  private static final String SKETCH_FIELD_TYPE = "BYTES";
   // SHA-1 hash of string "[3]", the string representation of a row that has 
only one field 3 in it
   private static final String EXPECTED_CHECKSUM = 
"f1e31df9806ce94c5bdbbfff9608324930f4d3f1";
 
+  static {
+ApplicationNameOptions options =
+TestPipeline.testingPipelineOptions().as(ApplicationNameOptions.class);
+APP_NAME = options.getAppName();
+PROJECT_ID = options.as(GcpOptions.class).getProject();
+DATASET_ID = String.format("zetasketch_%tY_% bq can't be final since this method is not a constructor.
   
   minor: I mean you can make `bq` as a static attr of your test class.
   
   >  It is fine though because BigqueryClient.getClient() does caching for us 
so it won't create another new client the second time we call it.
   
   minor: I don't think there is any caching logic in `BigqueryClient`.  ` 
BigqueryClient.getClient()` just returns `new BigqueryClient()` simply
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 301528)
Time Spent: 28.5h  (was: 28h 20m)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 28.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-08-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=301527=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-301527
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 26/Aug/19 21:30
Start Date: 26/Aug/19 21:30
Worklog Time Spent: 10m 
  Work Description: boyuanzz commented on pull request #9144: [BEAM-7013] 
Integrating ZetaSketch's HLL++ algorithm with Beam
URL: https://github.com/apache/beam/pull/9144#discussion_r317807506
 
 

 ##
 File path: 
sdks/java/extensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/BigQueryHllSketchCompatibilityIT.java
 ##
 @@ -49,26 +57,74 @@
 @RunWith(JUnit4.class)
 public class BigQueryHllSketchCompatibilityIT {
 
-  private static final String DATASET_NAME = "zetasketch_compatibility_test";
+  private static final String APP_NAME;
+  private static final String PROJECT_ID;
+  private static final String DATASET_ID;
 
-  // Table for testReadSketchFromBigQuery()
+  private static final List TEST_DATA =
+  Arrays.asList("Apple", "Orange", "Banana", "Orange");
+
+  // Data Table: used by testReadSketchFromBigQuery())
   // Schema: only one STRING field named "data".
   // Content: prepopulated with 4 rows: "Apple", "Orange", "Banana", "Orange"
-  private static final String DATA_TABLE_NAME = "hll_data";
+  private static final String DATA_TABLE_ID = "hll_data";
   private static final String DATA_FIELD_NAME = "data";
+  private static final String DATA_FIELD_TYPE = "STRING";
   private static final String QUERY_RESULT_FIELD_NAME = "sketch";
   private static final Long EXPECTED_COUNT = 3L;
 
-  // Table for testWriteSketchToBigQuery()
+  // Sketch Table: used by testWriteSketchToBigQuery()
   // Schema: only one BYTES field named "sketch".
   // Content: will be overridden by the sketch computed by the test pipeline 
each time the test runs
-  private static final String SKETCH_TABLE_NAME = "hll_sketch";
+  private static final String SKETCH_TABLE_ID = "hll_sketch";
   private static final String SKETCH_FIELD_NAME = "sketch";
-  private static final List TEST_DATA =
-  Arrays.asList("Apple", "Orange", "Banana", "Orange");
+  private static final String SKETCH_FIELD_TYPE = "BYTES";
   // SHA-1 hash of string "[3]", the string representation of a row that has 
only one field 3 in it
   private static final String EXPECTED_CHECKSUM = 
"f1e31df9806ce94c5bdbbfff9608324930f4d3f1";
 
+  static {
+ApplicationNameOptions options =
+TestPipeline.testingPipelineOptions().as(ApplicationNameOptions.class);
+APP_NAME = options.getAppName();
+PROJECT_ID = options.as(GcpOptions.class).getProject();
+DATASET_ID = String.format("zetasketch_%tY_% A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 28h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-08-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=301524=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-301524
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 26/Aug/19 21:25
Start Date: 26/Aug/19 21:25
Worklog Time Spent: 10m 
  Work Description: robinyqiu commented on issue #9144: [BEAM-7013] 
Integrating ZetaSketch's HLL++ algorithm with Beam
URL: https://github.com/apache/beam/pull/9144#issuecomment-525039196
 
 
   Thank you all @boyuanzz @zfraa @amaliujia @reuvenlax for the review! I have 
squashed all the commits into one and it think this PR is ready to be merged 
once the precommit tests pass.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 301524)
Time Spent: 28h 10m  (was: 28h)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 28h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-08-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=301523=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-301523
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 26/Aug/19 21:22
Start Date: 26/Aug/19 21:22
Worklog Time Spent: 10m 
  Work Description: robinyqiu commented on pull request #9144: [BEAM-7013] 
Integrating ZetaSketch's HLL++ algorithm with Beam
URL: https://github.com/apache/beam/pull/9144#discussion_r317804729
 
 

 ##
 File path: 
sdks/java/extensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/BigQueryHllSketchCompatibilityIT.java
 ##
 @@ -49,26 +57,74 @@
 @RunWith(JUnit4.class)
 public class BigQueryHllSketchCompatibilityIT {
 
-  private static final String DATASET_NAME = "zetasketch_compatibility_test";
+  private static final String APP_NAME;
+  private static final String PROJECT_ID;
+  private static final String DATASET_ID;
 
-  // Table for testReadSketchFromBigQuery()
+  private static final List TEST_DATA =
+  Arrays.asList("Apple", "Orange", "Banana", "Orange");
+
+  // Data Table: used by testReadSketchFromBigQuery())
   // Schema: only one STRING field named "data".
   // Content: prepopulated with 4 rows: "Apple", "Orange", "Banana", "Orange"
-  private static final String DATA_TABLE_NAME = "hll_data";
+  private static final String DATA_TABLE_ID = "hll_data";
   private static final String DATA_FIELD_NAME = "data";
+  private static final String DATA_FIELD_TYPE = "STRING";
   private static final String QUERY_RESULT_FIELD_NAME = "sketch";
   private static final Long EXPECTED_COUNT = 3L;
 
-  // Table for testWriteSketchToBigQuery()
+  // Sketch Table: used by testWriteSketchToBigQuery()
   // Schema: only one BYTES field named "sketch".
   // Content: will be overridden by the sketch computed by the test pipeline 
each time the test runs
-  private static final String SKETCH_TABLE_NAME = "hll_sketch";
+  private static final String SKETCH_TABLE_ID = "hll_sketch";
   private static final String SKETCH_FIELD_NAME = "sketch";
-  private static final List TEST_DATA =
-  Arrays.asList("Apple", "Orange", "Banana", "Orange");
+  private static final String SKETCH_FIELD_TYPE = "BYTES";
   // SHA-1 hash of string "[3]", the string representation of a row that has 
only one field 3 in it
   private static final String EXPECTED_CHECKSUM = 
"f1e31df9806ce94c5bdbbfff9608324930f4d3f1";
 
+  static {
+ApplicationNameOptions options =
+TestPipeline.testingPipelineOptions().as(ApplicationNameOptions.class);
+APP_NAME = options.getAppName();
+PROJECT_ID = options.as(GcpOptions.class).getProject();
+DATASET_ID = String.format("zetasketch_%tY_% A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 28h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-08-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=301522=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-301522
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 26/Aug/19 21:22
Start Date: 26/Aug/19 21:22
Worklog Time Spent: 10m 
  Work Description: robinyqiu commented on pull request #9144: [BEAM-7013] 
Integrating ZetaSketch's HLL++ algorithm with Beam
URL: https://github.com/apache/beam/pull/9144#discussion_r317804549
 
 

 ##
 File path: 
sdks/java/extensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/BigQueryHllSketchCompatibilityIT.java
 ##
 @@ -49,26 +57,74 @@
 @RunWith(JUnit4.class)
 public class BigQueryHllSketchCompatibilityIT {
 
-  private static final String DATASET_NAME = "zetasketch_compatibility_test";
+  private static final String APP_NAME;
+  private static final String PROJECT_ID;
+  private static final String DATASET_ID;
 
-  // Table for testReadSketchFromBigQuery()
+  private static final List TEST_DATA =
+  Arrays.asList("Apple", "Orange", "Banana", "Orange");
+
+  // Data Table: used by testReadSketchFromBigQuery())
   // Schema: only one STRING field named "data".
   // Content: prepopulated with 4 rows: "Apple", "Orange", "Banana", "Orange"
-  private static final String DATA_TABLE_NAME = "hll_data";
+  private static final String DATA_TABLE_ID = "hll_data";
   private static final String DATA_FIELD_NAME = "data";
+  private static final String DATA_FIELD_TYPE = "STRING";
   private static final String QUERY_RESULT_FIELD_NAME = "sketch";
   private static final Long EXPECTED_COUNT = 3L;
 
-  // Table for testWriteSketchToBigQuery()
+  // Sketch Table: used by testWriteSketchToBigQuery()
   // Schema: only one BYTES field named "sketch".
   // Content: will be overridden by the sketch computed by the test pipeline 
each time the test runs
-  private static final String SKETCH_TABLE_NAME = "hll_sketch";
+  private static final String SKETCH_TABLE_ID = "hll_sketch";
   private static final String SKETCH_FIELD_NAME = "sketch";
-  private static final List TEST_DATA =
-  Arrays.asList("Apple", "Orange", "Banana", "Orange");
+  private static final String SKETCH_FIELD_TYPE = "BYTES";
   // SHA-1 hash of string "[3]", the string representation of a row that has 
only one field 3 in it
   private static final String EXPECTED_CHECKSUM = 
"f1e31df9806ce94c5bdbbfff9608324930f4d3f1";
 
+  static {
+ApplicationNameOptions options =
+TestPipeline.testingPipelineOptions().as(ApplicationNameOptions.class);
+APP_NAME = options.getAppName();
+PROJECT_ID = options.as(GcpOptions.class).getProject();
+DATASET_ID = String.format("zetasketch_%tY_% A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 27h 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-08-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=301406=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-301406
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 26/Aug/19 18:24
Start Date: 26/Aug/19 18:24
Worklog Time Spent: 10m 
  Work Description: boyuanzz commented on pull request #9144: [BEAM-7013] 
Integrating ZetaSketch's HLL++ algorithm with Beam
URL: https://github.com/apache/beam/pull/9144#discussion_r317731169
 
 

 ##
 File path: 
sdks/java/extensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/BigQueryHllSketchCompatibilityIT.java
 ##
 @@ -49,26 +57,74 @@
 @RunWith(JUnit4.class)
 public class BigQueryHllSketchCompatibilityIT {
 
-  private static final String DATASET_NAME = "zetasketch_compatibility_test";
+  private static final String APP_NAME;
+  private static final String PROJECT_ID;
+  private static final String DATASET_ID;
 
-  // Table for testReadSketchFromBigQuery()
+  private static final List TEST_DATA =
+  Arrays.asList("Apple", "Orange", "Banana", "Orange");
+
+  // Data Table: used by testReadSketchFromBigQuery())
   // Schema: only one STRING field named "data".
   // Content: prepopulated with 4 rows: "Apple", "Orange", "Banana", "Orange"
-  private static final String DATA_TABLE_NAME = "hll_data";
+  private static final String DATA_TABLE_ID = "hll_data";
   private static final String DATA_FIELD_NAME = "data";
+  private static final String DATA_FIELD_TYPE = "STRING";
   private static final String QUERY_RESULT_FIELD_NAME = "sketch";
   private static final Long EXPECTED_COUNT = 3L;
 
-  // Table for testWriteSketchToBigQuery()
+  // Sketch Table: used by testWriteSketchToBigQuery()
   // Schema: only one BYTES field named "sketch".
   // Content: will be overridden by the sketch computed by the test pipeline 
each time the test runs
-  private static final String SKETCH_TABLE_NAME = "hll_sketch";
+  private static final String SKETCH_TABLE_ID = "hll_sketch";
   private static final String SKETCH_FIELD_NAME = "sketch";
-  private static final List TEST_DATA =
-  Arrays.asList("Apple", "Orange", "Banana", "Orange");
+  private static final String SKETCH_FIELD_TYPE = "BYTES";
   // SHA-1 hash of string "[3]", the string representation of a row that has 
only one field 3 in it
   private static final String EXPECTED_CHECKSUM = 
"f1e31df9806ce94c5bdbbfff9608324930f4d3f1";
 
+  static {
+ApplicationNameOptions options =
+TestPipeline.testingPipelineOptions().as(ApplicationNameOptions.class);
+APP_NAME = options.getAppName();
+PROJECT_ID = options.as(GcpOptions.class).getProject();
+DATASET_ID = String.format("zetasketch_%tY_% A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 27h 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-08-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=301405=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-301405
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 26/Aug/19 18:24
Start Date: 26/Aug/19 18:24
Worklog Time Spent: 10m 
  Work Description: boyuanzz commented on pull request #9144: [BEAM-7013] 
Integrating ZetaSketch's HLL++ algorithm with Beam
URL: https://github.com/apache/beam/pull/9144#discussion_r317730080
 
 

 ##
 File path: 
sdks/java/extensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/BigQueryHllSketchCompatibilityIT.java
 ##
 @@ -49,26 +57,74 @@
 @RunWith(JUnit4.class)
 public class BigQueryHllSketchCompatibilityIT {
 
-  private static final String DATASET_NAME = "zetasketch_compatibility_test";
+  private static final String APP_NAME;
+  private static final String PROJECT_ID;
+  private static final String DATASET_ID;
 
-  // Table for testReadSketchFromBigQuery()
+  private static final List TEST_DATA =
+  Arrays.asList("Apple", "Orange", "Banana", "Orange");
+
+  // Data Table: used by testReadSketchFromBigQuery())
   // Schema: only one STRING field named "data".
   // Content: prepopulated with 4 rows: "Apple", "Orange", "Banana", "Orange"
-  private static final String DATA_TABLE_NAME = "hll_data";
+  private static final String DATA_TABLE_ID = "hll_data";
   private static final String DATA_FIELD_NAME = "data";
+  private static final String DATA_FIELD_TYPE = "STRING";
   private static final String QUERY_RESULT_FIELD_NAME = "sketch";
   private static final Long EXPECTED_COUNT = 3L;
 
-  // Table for testWriteSketchToBigQuery()
+  // Sketch Table: used by testWriteSketchToBigQuery()
   // Schema: only one BYTES field named "sketch".
   // Content: will be overridden by the sketch computed by the test pipeline 
each time the test runs
-  private static final String SKETCH_TABLE_NAME = "hll_sketch";
+  private static final String SKETCH_TABLE_ID = "hll_sketch";
   private static final String SKETCH_FIELD_NAME = "sketch";
-  private static final List TEST_DATA =
-  Arrays.asList("Apple", "Orange", "Banana", "Orange");
+  private static final String SKETCH_FIELD_TYPE = "BYTES";
   // SHA-1 hash of string "[3]", the string representation of a row that has 
only one field 3 in it
   private static final String EXPECTED_CHECKSUM = 
"f1e31df9806ce94c5bdbbfff9608324930f4d3f1";
 
+  static {
+ApplicationNameOptions options =
+TestPipeline.testingPipelineOptions().as(ApplicationNameOptions.class);
+APP_NAME = options.getAppName();
+PROJECT_ID = options.as(GcpOptions.class).getProject();
+DATASET_ID = String.format("zetasketch_%tY_% A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 27.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-08-26 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=301403=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-301403
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 26/Aug/19 18:21
Start Date: 26/Aug/19 18:21
Worklog Time Spent: 10m 
  Work Description: robinyqiu commented on issue #9144: [BEAM-7013] 
Integrating ZetaSketch's HLL++ algorithm with Beam
URL: https://github.com/apache/beam/pull/9144#issuecomment-524971095
 
 
   Run Java PostCommit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 301403)
Time Spent: 27h 20m  (was: 27h 10m)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 27h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-08-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=300067=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-300067
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 23/Aug/19 06:09
Start Date: 23/Aug/19 06:09
Worklog Time Spent: 10m 
  Work Description: robinyqiu commented on issue #9144: [BEAM-7013] 
Integrating ZetaSketch's HLL++ algorithm with Beam
URL: https://github.com/apache/beam/pull/9144#issuecomment-524185713
 
 
   Run Java PostCommit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 300067)
Time Spent: 27h 10m  (was: 27h)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 27h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-08-22 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=299822=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-299822
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 22/Aug/19 23:33
Start Date: 22/Aug/19 23:33
Worklog Time Spent: 10m 
  Work Description: robinyqiu commented on issue #9144: [BEAM-7013] 
Integrating ZetaSketch's HLL++ algorithm with Beam
URL: https://github.com/apache/beam/pull/9144#issuecomment-524116749
 
 
   Run Java PostCommit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 299822)
Time Spent: 27h  (was: 26h 50m)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 27h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-08-22 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=299820=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-299820
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 22/Aug/19 23:28
Start Date: 22/Aug/19 23:28
Worklog Time Spent: 10m 
  Work Description: robinyqiu commented on issue #9144: [BEAM-7013] 
Integrating ZetaSketch's HLL++ algorithm with Beam
URL: https://github.com/apache/beam/pull/9144#issuecomment-524115723
 
 
   I have made the change such that the BQ tables needed for testing is now 
created before the tests and deleted after the tests. PTAL.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 299820)
Time Spent: 26h 50m  (was: 26h 40m)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 26h 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-08-22 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=299819=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-299819
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 22/Aug/19 23:27
Start Date: 22/Aug/19 23:27
Worklog Time Spent: 10m 
  Work Description: robinyqiu commented on pull request #9144: [BEAM-7013] 
Integrating ZetaSketch's HLL++ algorithm with Beam
URL: https://github.com/apache/beam/pull/9144#discussion_r316924297
 
 

 ##
 File path: 
sdks/java/extensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/BigQueryHllSketchCompatibilityIT.java
 ##
 @@ -0,0 +1,145 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.beam.sdk.extensions.zetasketch;
+
+import com.google.api.services.bigquery.model.TableFieldSchema;
+import com.google.api.services.bigquery.model.TableRow;
+import com.google.api.services.bigquery.model.TableSchema;
+import java.nio.ByteBuffer;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.List;
+import org.apache.beam.sdk.Pipeline;
+import org.apache.beam.sdk.coders.ByteArrayCoder;
+import org.apache.beam.sdk.extensions.gcp.options.GcpOptions;
+import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO;
+import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.TypedRead.Method;
+import org.apache.beam.sdk.io.gcp.bigquery.SchemaAndRecord;
+import org.apache.beam.sdk.io.gcp.testing.BigqueryMatcher;
+import org.apache.beam.sdk.options.ApplicationNameOptions;
+import org.apache.beam.sdk.testing.PAssert;
+import org.apache.beam.sdk.testing.TestPipeline;
+import org.apache.beam.sdk.testing.TestPipelineOptions;
+import org.apache.beam.sdk.transforms.Create;
+import org.apache.beam.sdk.transforms.SerializableFunction;
+import org.apache.beam.sdk.values.PCollection;
+import org.junit.Test;
+import org.junit.runner.RunWith;
+import org.junit.runners.JUnit4;
+
+/**
+ * Integration tests for HLL++ sketch compatibility between Beam and BigQuery. 
The tests verifies
+ * that HLL++ sketches created in Beam can be processed by BigQuery, and vice 
versa.
+ */
+@RunWith(JUnit4.class)
+public class BigQueryHllSketchCompatibilityIT {
+
+  private static final String DATASET_NAME = "zetasketch_compatibility_test";
+
+  // Table for testReadSketchFromBigQuery()
+  // Schema: only one STRING field named "data".
+  // Content: prepopulated with 4 rows: "Apple", "Orange", "Banana", "Orange"
+  private static final String DATA_TABLE_NAME = "hll_data";
+  private static final String DATA_FIELD_NAME = "data";
+  private static final String QUERY_RESULT_FIELD_NAME = "sketch";
+  private static final Long EXPECTED_COUNT = 3L;
+
+  // Table for testWriteSketchToBigQuery()
+  // Schema: only one BYTES field named "sketch".
+  // Content: will be overridden by the sketch computed by the test pipeline 
each time the test runs
+  private static final String SKETCH_TABLE_NAME = "hll_sketch";
+  private static final String SKETCH_FIELD_NAME = "sketch";
+  private static final List TEST_DATA =
+  Arrays.asList("Apple", "Orange", "Banana", "Orange");
+  // SHA-1 hash of string "[3]", the string representation of a row that has 
only one field 3 in it
+  private static final String EXPECTED_CHECKSUM = 
"f1e31df9806ce94c5bdbbfff9608324930f4d3f1";
+
+  /**
+   * Test that HLL++ sketch computed in BigQuery can be processed by Beam. Hll 
sketch is computed by
+   * {@code HLL_COUNT.INIT} in BigQuery and read into Beam; the test verifies 
that we can run {@link
+   * HllCount.MergePartial} and {@link HllCount.Extract} on the sketch in Beam 
to get the correct
+   * estimated count.
+   */
+  @Test
+  public void testReadSketchFromBigQuery() {
 
 Review comment:
   Done.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-08-22 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=299818=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-299818
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 22/Aug/19 23:27
Start Date: 22/Aug/19 23:27
Worklog Time Spent: 10m 
  Work Description: robinyqiu commented on pull request #9144: [BEAM-7013] 
Integrating ZetaSketch's HLL++ algorithm with Beam
URL: https://github.com/apache/beam/pull/9144#discussion_r316924234
 
 

 ##
 File path: sdks/java/extensions/zetasketch/build.gradle
 ##
 @@ -0,0 +1,64 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+import groovy.json.JsonOutput
+
+plugins { id 'org.apache.beam.module' }
+applyJavaNature()
+
+description = "Apache Beam :: SDKs :: Java :: Extensions :: ZetaSketch"
+
+def zetasketch_version = "0.1.0"
+
+dependencies {
+compile library.java.vendored_guava_26_0_jre
+compile project(path: ":sdks:java:core", configuration: "shadow")
+compile "com.google.zetasketch:zetasketch:$zetasketch_version"
+testCompile library.java.junit
+testCompile project(":sdks:java:io:google-cloud-platform")
+testRuntimeOnly project(":runners:direct-java")
+testRuntimeOnly project(":runners:google-cloud-dataflow-java")
+}
+
+/**
+ * Integration tests running on Dataflow with BigQuery.
+ */
+task integrationTest(type: Test) {
+group = "Verification"
+def gcpProject = project.findProperty('gcpProject') ?: 
'apache-beam-testing'
+def gcpTempRoot = project.findProperty('gcpTempRoot') ?: 
'gs://temp-storage-for-end-to-end-tests'
+systemProperty "beamTestPipelineOptions", JsonOutput.toJson([
+"--runner=TestDataflowRunner",
+"--project=${gcpProject}",
+"--tempRoot=${gcpTempRoot}",
+])
 
 Review comment:
   Done.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 299818)
Time Spent: 26.5h  (was: 26h 20m)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 26.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-08-21 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=299083=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-299083
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 21/Aug/19 23:57
Start Date: 21/Aug/19 23:57
Worklog Time Spent: 10m 
  Work Description: boyuanzz commented on pull request #9144: [BEAM-7013] 
Integrating ZetaSketch's HLL++ algorithm with Beam
URL: https://github.com/apache/beam/pull/9144#discussion_r316449066
 
 

 ##
 File path: 
sdks/java/extensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/BigQueryHllSketchCompatibilityIT.java
 ##
 @@ -0,0 +1,145 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.beam.sdk.extensions.zetasketch;
+
+import com.google.api.services.bigquery.model.TableFieldSchema;
+import com.google.api.services.bigquery.model.TableRow;
+import com.google.api.services.bigquery.model.TableSchema;
+import java.nio.ByteBuffer;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.List;
+import org.apache.beam.sdk.Pipeline;
+import org.apache.beam.sdk.coders.ByteArrayCoder;
+import org.apache.beam.sdk.extensions.gcp.options.GcpOptions;
+import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO;
+import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.TypedRead.Method;
+import org.apache.beam.sdk.io.gcp.bigquery.SchemaAndRecord;
+import org.apache.beam.sdk.io.gcp.testing.BigqueryMatcher;
+import org.apache.beam.sdk.options.ApplicationNameOptions;
+import org.apache.beam.sdk.testing.PAssert;
+import org.apache.beam.sdk.testing.TestPipeline;
+import org.apache.beam.sdk.testing.TestPipelineOptions;
+import org.apache.beam.sdk.transforms.Create;
+import org.apache.beam.sdk.transforms.SerializableFunction;
+import org.apache.beam.sdk.values.PCollection;
+import org.junit.Test;
+import org.junit.runner.RunWith;
+import org.junit.runners.JUnit4;
+
+/**
+ * Integration tests for HLL++ sketch compatibility between Beam and BigQuery. 
The tests verifies
+ * that HLL++ sketches created in Beam can be processed by BigQuery, and vice 
versa.
+ */
+@RunWith(JUnit4.class)
+public class BigQueryHllSketchCompatibilityIT {
+
+  private static final String DATASET_NAME = "zetasketch_compatibility_test";
+
+  // Table for testReadSketchFromBigQuery()
+  // Schema: only one STRING field named "data".
+  // Content: prepopulated with 4 rows: "Apple", "Orange", "Banana", "Orange"
+  private static final String DATA_TABLE_NAME = "hll_data";
+  private static final String DATA_FIELD_NAME = "data";
+  private static final String QUERY_RESULT_FIELD_NAME = "sketch";
+  private static final Long EXPECTED_COUNT = 3L;
+
+  // Table for testWriteSketchToBigQuery()
+  // Schema: only one BYTES field named "sketch".
+  // Content: will be overridden by the sketch computed by the test pipeline 
each time the test runs
+  private static final String SKETCH_TABLE_NAME = "hll_sketch";
+  private static final String SKETCH_FIELD_NAME = "sketch";
+  private static final List TEST_DATA =
+  Arrays.asList("Apple", "Orange", "Banana", "Orange");
+  // SHA-1 hash of string "[3]", the string representation of a row that has 
only one field 3 in it
+  private static final String EXPECTED_CHECKSUM = 
"f1e31df9806ce94c5bdbbfff9608324930f4d3f1";
+
+  /**
+   * Test that HLL++ sketch computed in BigQuery can be processed by Beam. Hll 
sketch is computed by
+   * {@code HLL_COUNT.INIT} in BigQuery and read into Beam; the test verifies 
that we can run {@link
+   * HllCount.MergePartial} and {@link HllCount.Extract} on the sketch in Beam 
to get the correct
+   * estimated count.
+   */
+  @Test
+  public void testReadSketchFromBigQuery() {
 
 Review comment:
   
https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/testing/BigqueryClient.java
 is the util class that could help BQ ITs to create/cleanup test data.
 

This is an automated message from the Apache Git Service.
To respond to the message, 

[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-08-21 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=299082=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-299082
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 21/Aug/19 23:52
Start Date: 21/Aug/19 23:52
Worklog Time Spent: 10m 
  Work Description: robinyqiu commented on pull request #9144: [BEAM-7013] 
Integrating ZetaSketch's HLL++ algorithm with Beam
URL: https://github.com/apache/beam/pull/9144#discussion_r316447983
 
 

 ##
 File path: 
sdks/java/extensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/BigQueryHllSketchCompatibilityIT.java
 ##
 @@ -0,0 +1,145 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.beam.sdk.extensions.zetasketch;
+
+import com.google.api.services.bigquery.model.TableFieldSchema;
+import com.google.api.services.bigquery.model.TableRow;
+import com.google.api.services.bigquery.model.TableSchema;
+import java.nio.ByteBuffer;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.List;
+import org.apache.beam.sdk.Pipeline;
+import org.apache.beam.sdk.coders.ByteArrayCoder;
+import org.apache.beam.sdk.extensions.gcp.options.GcpOptions;
+import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO;
+import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.TypedRead.Method;
+import org.apache.beam.sdk.io.gcp.bigquery.SchemaAndRecord;
+import org.apache.beam.sdk.io.gcp.testing.BigqueryMatcher;
+import org.apache.beam.sdk.options.ApplicationNameOptions;
+import org.apache.beam.sdk.testing.PAssert;
+import org.apache.beam.sdk.testing.TestPipeline;
+import org.apache.beam.sdk.testing.TestPipelineOptions;
+import org.apache.beam.sdk.transforms.Create;
+import org.apache.beam.sdk.transforms.SerializableFunction;
+import org.apache.beam.sdk.values.PCollection;
+import org.junit.Test;
+import org.junit.runner.RunWith;
+import org.junit.runners.JUnit4;
+
+/**
+ * Integration tests for HLL++ sketch compatibility between Beam and BigQuery. 
The tests verifies
+ * that HLL++ sketches created in Beam can be processed by BigQuery, and vice 
versa.
+ */
+@RunWith(JUnit4.class)
+public class BigQueryHllSketchCompatibilityIT {
+
+  private static final String DATASET_NAME = "zetasketch_compatibility_test";
+
+  // Table for testReadSketchFromBigQuery()
+  // Schema: only one STRING field named "data".
+  // Content: prepopulated with 4 rows: "Apple", "Orange", "Banana", "Orange"
+  private static final String DATA_TABLE_NAME = "hll_data";
+  private static final String DATA_FIELD_NAME = "data";
+  private static final String QUERY_RESULT_FIELD_NAME = "sketch";
+  private static final Long EXPECTED_COUNT = 3L;
+
+  // Table for testWriteSketchToBigQuery()
+  // Schema: only one BYTES field named "sketch".
+  // Content: will be overridden by the sketch computed by the test pipeline 
each time the test runs
+  private static final String SKETCH_TABLE_NAME = "hll_sketch";
+  private static final String SKETCH_FIELD_NAME = "sketch";
+  private static final List TEST_DATA =
+  Arrays.asList("Apple", "Orange", "Banana", "Orange");
+  // SHA-1 hash of string "[3]", the string representation of a row that has 
only one field 3 in it
+  private static final String EXPECTED_CHECKSUM = 
"f1e31df9806ce94c5bdbbfff9608324930f4d3f1";
+
+  /**
+   * Test that HLL++ sketch computed in BigQuery can be processed by Beam. Hll 
sketch is computed by
+   * {@code HLL_COUNT.INIT} in BigQuery and read into Beam; the test verifies 
that we can run {@link
+   * HllCount.MergePartial} and {@link HllCount.Extract} on the sketch in Beam 
to get the correct
+   * estimated count.
+   */
+  @Test
+  public void testReadSketchFromBigQuery() {
 
 Review comment:
   This is a very good point. I agree with you that ideally a test should be 
self-contained and not depend on any external resources. The reason why I did 
that is simply because the other BigQueryIO integration tests under this 

[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-08-21 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=299074=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-299074
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 21/Aug/19 23:43
Start Date: 21/Aug/19 23:43
Worklog Time Spent: 10m 
  Work Description: robinyqiu commented on pull request #9144: [BEAM-7013] 
Integrating ZetaSketch's HLL++ algorithm with Beam
URL: https://github.com/apache/beam/pull/9144#discussion_r316446092
 
 

 ##
 File path: sdks/java/extensions/zetasketch/build.gradle
 ##
 @@ -0,0 +1,64 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+import groovy.json.JsonOutput
+
+plugins { id 'org.apache.beam.module' }
+applyJavaNature()
+
+description = "Apache Beam :: SDKs :: Java :: Extensions :: ZetaSketch"
+
+def zetasketch_version = "0.1.0"
+
+dependencies {
+compile library.java.vendored_guava_26_0_jre
+compile project(path: ":sdks:java:core", configuration: "shadow")
+compile "com.google.zetasketch:zetasketch:$zetasketch_version"
+testCompile library.java.junit
+testCompile project(":sdks:java:io:google-cloud-platform")
+testRuntimeOnly project(":runners:direct-java")
+testRuntimeOnly project(":runners:google-cloud-dataflow-java")
+}
+
+/**
+ * Integration tests running on Dataflow with BigQuery.
+ */
+task integrationTest(type: Test) {
+group = "Verification"
+def gcpProject = project.findProperty('gcpProject') ?: 
'apache-beam-testing'
+def gcpTempRoot = project.findProperty('gcpTempRoot') ?: 
'gs://temp-storage-for-end-to-end-tests'
+systemProperty "beamTestPipelineOptions", JsonOutput.toJson([
+"--runner=TestDataflowRunner",
+"--project=${gcpProject}",
+"--tempRoot=${gcpTempRoot}",
+])
 
 Review comment:
   Thanks for the pointer. Will do.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 299074)
Time Spent: 26h  (was: 25h 50m)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 26h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-08-21 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=299073=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-299073
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 21/Aug/19 23:33
Start Date: 21/Aug/19 23:33
Worklog Time Spent: 10m 
  Work Description: robinyqiu commented on pull request #9144: [BEAM-7013] 
Integrating ZetaSketch's HLL++ algorithm with Beam
URL: https://github.com/apache/beam/pull/9144#discussion_r316444192
 
 

 ##
 File path: 
sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/testing/BigqueryMatcher.java
 ##
 @@ -63,32 +63,52 @@
 
   private final String projectId;
   private final String query;
+  private final boolean usingStandardSql;
 
 Review comment:
   I have considered this. But in our Beam `BigQueryIO` source we also have
   
https://github.com/apache/beam/blob/08d0146791e38be4641ff80ffb2539cdc81f5b6d/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L600
   
   Since this is a public API visible to Beam users but the BQ client is not, I 
decided to be consistent with the former (and therefore I have to negate the 
boolean somewhere in the function call stack).
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 299073)
Time Spent: 25h 50m  (was: 25h 40m)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 25h 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-08-21 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=299003=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-299003
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 21/Aug/19 20:50
Start Date: 21/Aug/19 20:50
Worklog Time Spent: 10m 
  Work Description: boyuanzz commented on pull request #9144: [BEAM-7013] 
Integrating ZetaSketch's HLL++ algorithm with Beam
URL: https://github.com/apache/beam/pull/9144#discussion_r316396289
 
 

 ##
 File path: 
sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/testing/BigqueryMatcher.java
 ##
 @@ -63,32 +63,52 @@
 
   private final String projectId;
   private final String query;
+  private final boolean usingStandardSql;
 
 Review comment:
   Thanks for the pointer! Then maybe we should setup `usingLegacySql` instead 
of `usingStandardSql` to keep consistent with bq model. wdyt?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 299003)
Time Spent: 25h 40m  (was: 25.5h)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 25h 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-08-21 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=299000=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-299000
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 21/Aug/19 20:46
Start Date: 21/Aug/19 20:46
Worklog Time Spent: 10m 
  Work Description: boyuanzz commented on pull request #9144: [BEAM-7013] 
Integrating ZetaSketch's HLL++ algorithm with Beam
URL: https://github.com/apache/beam/pull/9144#discussion_r316394576
 
 

 ##
 File path: 
sdks/java/extensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/BigQueryHllSketchCompatibilityIT.java
 ##
 @@ -0,0 +1,145 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.beam.sdk.extensions.zetasketch;
+
+import com.google.api.services.bigquery.model.TableFieldSchema;
+import com.google.api.services.bigquery.model.TableRow;
+import com.google.api.services.bigquery.model.TableSchema;
+import java.nio.ByteBuffer;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.List;
+import org.apache.beam.sdk.Pipeline;
+import org.apache.beam.sdk.coders.ByteArrayCoder;
+import org.apache.beam.sdk.extensions.gcp.options.GcpOptions;
+import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO;
+import org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.TypedRead.Method;
+import org.apache.beam.sdk.io.gcp.bigquery.SchemaAndRecord;
+import org.apache.beam.sdk.io.gcp.testing.BigqueryMatcher;
+import org.apache.beam.sdk.options.ApplicationNameOptions;
+import org.apache.beam.sdk.testing.PAssert;
+import org.apache.beam.sdk.testing.TestPipeline;
+import org.apache.beam.sdk.testing.TestPipelineOptions;
+import org.apache.beam.sdk.transforms.Create;
+import org.apache.beam.sdk.transforms.SerializableFunction;
+import org.apache.beam.sdk.values.PCollection;
+import org.junit.Test;
+import org.junit.runner.RunWith;
+import org.junit.runners.JUnit4;
+
+/**
+ * Integration tests for HLL++ sketch compatibility between Beam and BigQuery. 
The tests verifies
+ * that HLL++ sketches created in Beam can be processed by BigQuery, and vice 
versa.
+ */
+@RunWith(JUnit4.class)
+public class BigQueryHllSketchCompatibilityIT {
+
+  private static final String DATASET_NAME = "zetasketch_compatibility_test";
+
+  // Table for testReadSketchFromBigQuery()
+  // Schema: only one STRING field named "data".
+  // Content: prepopulated with 4 rows: "Apple", "Orange", "Banana", "Orange"
+  private static final String DATA_TABLE_NAME = "hll_data";
+  private static final String DATA_FIELD_NAME = "data";
+  private static final String QUERY_RESULT_FIELD_NAME = "sketch";
+  private static final Long EXPECTED_COUNT = 3L;
+
+  // Table for testWriteSketchToBigQuery()
+  // Schema: only one BYTES field named "sketch".
+  // Content: will be overridden by the sketch computed by the test pipeline 
each time the test runs
+  private static final String SKETCH_TABLE_NAME = "hll_sketch";
+  private static final String SKETCH_FIELD_NAME = "sketch";
+  private static final List TEST_DATA =
+  Arrays.asList("Apple", "Orange", "Banana", "Orange");
+  // SHA-1 hash of string "[3]", the string representation of a row that has 
only one field 3 in it
+  private static final String EXPECTED_CHECKSUM = 
"f1e31df9806ce94c5bdbbfff9608324930f4d3f1";
+
+  /**
+   * Test that HLL++ sketch computed in BigQuery can be processed by Beam. Hll 
sketch is computed by
+   * {@code HLL_COUNT.INIT} in BigQuery and read into Beam; the test verifies 
that we can run {@link
+   * HllCount.MergePartial} and {@link HllCount.Extract} on the sketch in Beam 
to get the correct
+   * estimated count.
+   */
+  @Test
+  public void testReadSketchFromBigQuery() {
 
 Review comment:
   Any reason choosing to create test data manually? IMO, it would make 
operations harder under certain scenarios. For example, our infra team decides 
to using a project to run all ITs, then your test will be broken. Instead, how 
about creating your test data in `@BeforeClass` and deleting all data in 
`@AfterClass`?
 

[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-08-21 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=298996=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-298996
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 21/Aug/19 20:40
Start Date: 21/Aug/19 20:40
Worklog Time Spent: 10m 
  Work Description: boyuanzz commented on pull request #9144: [BEAM-7013] 
Integrating ZetaSketch's HLL++ algorithm with Beam
URL: https://github.com/apache/beam/pull/9144#discussion_r316392094
 
 

 ##
 File path: sdks/java/extensions/zetasketch/build.gradle
 ##
 @@ -0,0 +1,64 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+import groovy.json.JsonOutput
+
+plugins { id 'org.apache.beam.module' }
+applyJavaNature()
+
+description = "Apache Beam :: SDKs :: Java :: Extensions :: ZetaSketch"
+
+def zetasketch_version = "0.1.0"
+
+dependencies {
+compile library.java.vendored_guava_26_0_jre
+compile project(path: ":sdks:java:core", configuration: "shadow")
+compile "com.google.zetasketch:zetasketch:$zetasketch_version"
+testCompile library.java.junit
+testCompile project(":sdks:java:io:google-cloud-platform")
+testRuntimeOnly project(":runners:direct-java")
+testRuntimeOnly project(":runners:google-cloud-dataflow-java")
+}
+
+/**
+ * Integration tests running on Dataflow with BigQuery.
+ */
+task integrationTest(type: Test) {
+group = "Verification"
+def gcpProject = project.findProperty('gcpProject') ?: 
'apache-beam-testing'
+def gcpTempRoot = project.findProperty('gcpTempRoot') ?: 
'gs://temp-storage-for-end-to-end-tests'
+systemProperty "beamTestPipelineOptions", JsonOutput.toJson([
+"--runner=TestDataflowRunner",
+"--project=${gcpProject}",
+"--tempRoot=${gcpTempRoot}",
+])
 
 Review comment:
   But you want to run our test with DataflowRunner right? Then if you want to 
always run with the worker head, you need to add 2 more cmd args like:
   
https://github.com/apache/beam/blob/master/runners/google-cloud-dataflow-java/build.gradle#L161.
 Otherwise, the prebuilt worker image is used.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 298996)
Time Spent: 25h 20m  (was: 25h 10m)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 25h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-08-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=298287=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-298287
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 20/Aug/19 22:51
Start Date: 20/Aug/19 22:51
Worklog Time Spent: 10m 
  Work Description: robinyqiu commented on issue #9144: [BEAM-7013] 
Integrating ZetaSketch's HLL++ algorithm with Beam
URL: https://github.com/apache/beam/pull/9144#issuecomment-523225422
 
 
   Hi, @boyuanzz and @zfraa. Thanks again for your review! I have answered your 
questions and made necessary changes to the code. PTAL.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 298287)
Time Spent: 25h 10m  (was: 25h)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 25h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-08-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=298280=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-298280
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 20/Aug/19 22:48
Start Date: 20/Aug/19 22:48
Worklog Time Spent: 10m 
  Work Description: robinyqiu commented on pull request #9144: [BEAM-7013] 
Integrating ZetaSketch's HLL++ algorithm with Beam
URL: https://github.com/apache/beam/pull/9144#discussion_r315901106
 
 

 ##
 File path: sdks/java/extensions/zetasketch/build.gradle
 ##
 @@ -0,0 +1,64 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+import groovy.json.JsonOutput
+
 
 Review comment:
   Interesting. I looked into the usage of `evaluationDependsOn`. From the 
comments in the link you shared:
   
https://github.com/apache/beam/blob/ff7964c7252c8a0c670f69bb4291230ca6136afd/runners/direct-java/build.gradle#L44-L48
   
   it says it is only needed because in that script `sourceSets.test.output` of 
another project is directly referenced
   
https://github.com/apache/beam/blob/ff7964c7252c8a0c670f69bb4291230ca6136afd/runners/direct-java/build.gradle#L102-L104
   
   And I searched throughout the codebase and it seems to be the only reason 
that people adding `evaluationDependsOn` to their build script.
   
   However, direct reference of `sourceSets.test.output` is discouraged. Quote 
from https://discuss.gradle.org/t/evaluationdependson-annoyance/21783/2:
   >While you can reference tasks from other projects, it really should be a 
last resort. Using evaluationDependsOn is a bit of a smell. Dependencies, 
configurations, and artifacts are a better model than directly accessing 
outputs of a task from another project.
   
   So I would prefer not to add it to this script because we are not doing the 
direct referencing here.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 298280)
Time Spent: 24h 50m  (was: 24h 40m)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 24h 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-08-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=298281=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-298281
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 20/Aug/19 22:48
Start Date: 20/Aug/19 22:48
Worklog Time Spent: 10m 
  Work Description: robinyqiu commented on pull request #9144: [BEAM-7013] 
Integrating ZetaSketch's HLL++ algorithm with Beam
URL: https://github.com/apache/beam/pull/9144#discussion_r315901106
 
 

 ##
 File path: sdks/java/extensions/zetasketch/build.gradle
 ##
 @@ -0,0 +1,64 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+import groovy.json.JsonOutput
+
 
 Review comment:
   Interesting. I looked into the usage of `evaluationDependsOn`. From the 
comments in the link you shared:
   
https://github.com/apache/beam/blob/ff7964c7252c8a0c670f69bb4291230ca6136afd/runners/direct-java/build.gradle#L46-L48
   
   it says it is only needed because in that script `sourceSets.test.output` of 
another project is directly referenced
   
https://github.com/apache/beam/blob/ff7964c7252c8a0c670f69bb4291230ca6136afd/runners/direct-java/build.gradle#L102-L104
   
   And I searched throughout the codebase and it seems to be the only reason 
that people adding `evaluationDependsOn` to their build script.
   
   However, direct reference of `sourceSets.test.output` is discouraged. Quote 
from https://discuss.gradle.org/t/evaluationdependson-annoyance/21783/2:
   >While you can reference tasks from other projects, it really should be a 
last resort. Using evaluationDependsOn is a bit of a smell. Dependencies, 
configurations, and artifacts are a better model than directly accessing 
outputs of a task from another project.
   
   So I would prefer not to add it to this script because we are not doing the 
direct referencing here.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 298281)
Time Spent: 25h  (was: 24h 50m)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 25h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-08-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=298279=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-298279
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 20/Aug/19 22:47
Start Date: 20/Aug/19 22:47
Worklog Time Spent: 10m 
  Work Description: robinyqiu commented on pull request #9144: [BEAM-7013] 
Integrating ZetaSketch's HLL++ algorithm with Beam
URL: https://github.com/apache/beam/pull/9144#discussion_r315901106
 
 

 ##
 File path: sdks/java/extensions/zetasketch/build.gradle
 ##
 @@ -0,0 +1,64 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+import groovy.json.JsonOutput
+
 
 Review comment:
   Interesting. I looked into the usage of `evaluationDependsOn`. From the 
comments in the link you shared:
   
https://github.com/apache/beam/blob/ff7964c7252c8a0c670f69bb4291230ca6136afd/runners/direct-java/build.gradle#L50
   
   it says it is only needed because in that script `sourceSets.test.output` of 
another project is directly referenced
   
https://github.com/apache/beam/blob/ff7964c7252c8a0c670f69bb4291230ca6136afd/runners/direct-java/build.gradle#L102-L104
   
   And I searched throughout the codebase and it seems to be the only reason 
that people adding `evaluationDependsOn` to their build script.
   
   However, direct reference of `sourceSets.test.output` is discouraged. Quote 
from https://discuss.gradle.org/t/evaluationdependson-annoyance/21783/2:
   >While you can reference tasks from other projects, it really should be a 
last resort. Using evaluationDependsOn is a bit of a smell. Dependencies, 
configurations, and artifacts are a better model than directly accessing 
outputs of a task from another project.
   
   So I would prefer not to add it to this script because we are not doing the 
direct referencing here.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 298279)
Time Spent: 24h 40m  (was: 24.5h)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 24h 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-08-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=298277=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-298277
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 20/Aug/19 22:46
Start Date: 20/Aug/19 22:46
Worklog Time Spent: 10m 
  Work Description: robinyqiu commented on pull request #9144: [BEAM-7013] 
Integrating ZetaSketch's HLL++ algorithm with Beam
URL: https://github.com/apache/beam/pull/9144#discussion_r315901106
 
 

 ##
 File path: sdks/java/extensions/zetasketch/build.gradle
 ##
 @@ -0,0 +1,64 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+import groovy.json.JsonOutput
+
 
 Review comment:
   Interesting. I looked into the usage of `evaluationDependsOn`. From the 
comments in the link you shared:
   
https://github.com/apache/beam/blob/master/runners/direct-java/build.gradle#L46-L48
   
   it says it is only needed because in that script `sourceSets.test.output` of 
another project is directly referenced
   
https://github.com/apache/beam/blob/master/runners/direct-java/build.gradle#L102-L104
   
   And I searched throughout the codebase and it seems to be the only reason 
that people adding `evaluationDependsOn` to their build script.
   
   However, direct reference of `sourceSets.test.output` is discouraged. Quote 
from https://discuss.gradle.org/t/evaluationdependson-annoyance/21783/2:
   >While you can reference tasks from other projects, it really should be a 
last resort. Using evaluationDependsOn is a bit of a smell. Dependencies, 
configurations, and artifacts are a better model than directly accessing 
outputs of a task from another project.
   
   So I would prefer not to add it to this script because we are not doing the 
direct referencing here.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 298277)
Time Spent: 24.5h  (was: 24h 20m)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 24.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-08-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=298276=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-298276
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 20/Aug/19 22:45
Start Date: 20/Aug/19 22:45
Worklog Time Spent: 10m 
  Work Description: robinyqiu commented on pull request #9144: [BEAM-7013] 
Integrating ZetaSketch's HLL++ algorithm with Beam
URL: https://github.com/apache/beam/pull/9144#discussion_r315901106
 
 

 ##
 File path: sdks/java/extensions/zetasketch/build.gradle
 ##
 @@ -0,0 +1,64 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+import groovy.json.JsonOutput
+
 
 Review comment:
   Interesting. I looked into the usage of `evaluationDependsOn`. From the 
comments in the link you shared
   
   
https://github.com/apache/beam/blob/master/runners/direct-java/build.gradle#L46-L48
   
   it says it is only needed because in that script `sourceSets.test.output` of 
another project is directly referenced
   
https://github.com/apache/beam/blob/master/runners/direct-java/build.gradle#L102-L104
   
   And I searched throughout the codebase and it seems to be the only reason 
that people adding `evaluationDependsOn` to their build script.
   
   However, direct reference of `sourceSets.test.output` is discouraged. Quote 
from https://discuss.gradle.org/t/evaluationdependson-annoyance/21783/2:
   >While you can reference tasks from other projects, it really should be a 
last resort. Using evaluationDependsOn is a bit of a smell. Dependencies, 
configurations, and artifacts are a better model than directly accessing 
outputs of a task from another project.
   
   So I would prefer not to add it to this script because we are not doing the 
direct referencing here.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 298276)
Time Spent: 24h 20m  (was: 24h 10m)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 24h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-08-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=298275=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-298275
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 20/Aug/19 22:45
Start Date: 20/Aug/19 22:45
Worklog Time Spent: 10m 
  Work Description: robinyqiu commented on pull request #9144: [BEAM-7013] 
Integrating ZetaSketch's HLL++ algorithm with Beam
URL: https://github.com/apache/beam/pull/9144#discussion_r315901106
 
 

 ##
 File path: sdks/java/extensions/zetasketch/build.gradle
 ##
 @@ -0,0 +1,64 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+import groovy.json.JsonOutput
+
 
 Review comment:
   Interesting. I looked into the usage of `evaluationDependsOn`. From the 
comments in the link you shared
   
https://github.com/apache/beam/blob/master/runners/direct-java/build.gradle#L46-L48
   
   it says it is only needed because in that script `sourceSets.test.output` of 
another project is directly referenced
   
https://github.com/apache/beam/blob/master/runners/direct-java/build.gradle#L102-L104
   
   And I searched throughout the codebase and it seems to be the only reason 
that people adding `evaluationDependsOn` to their build script.
   
   However, direct reference of `sourceSets.test.output` is discouraged. Quote 
from https://discuss.gradle.org/t/evaluationdependson-annoyance/21783/2:
   >While you can reference tasks from other projects, it really should be a 
last resort. Using evaluationDependsOn is a bit of a smell. Dependencies, 
configurations, and artifacts are a better model than directly accessing 
outputs of a task from another project.
   
   So I would prefer not to add it to this script because we are not doing the 
direct referencing here.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 298275)
Time Spent: 24h 10m  (was: 24h)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 24h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-08-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=298186=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-298186
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 20/Aug/19 21:00
Start Date: 20/Aug/19 21:00
Worklog Time Spent: 10m 
  Work Description: robinyqiu commented on pull request #9144: [BEAM-7013] 
Integrating ZetaSketch's HLL++ algorithm with Beam
URL: https://github.com/apache/beam/pull/9144#discussion_r315901106
 
 

 ##
 File path: sdks/java/extensions/zetasketch/build.gradle
 ##
 @@ -0,0 +1,64 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+import groovy.json.JsonOutput
+
 
 Review comment:
   Interesting. I looked into the usage of `evaluationDependsOn`. From the 
comments in the link you shared
   
https://github.com/apache/beam/blob/master/runners/direct-java/build.gradle#L46-L48
   it says it is only needed because in that script `sourceSets.test.output` of 
another project is directly referenced
   
https://github.com/apache/beam/blob/master/runners/direct-java/build.gradle#L102-L104
   And I searched throughout the codebase and it seems to be the only reason 
that people adding `evaluationDependsOn` to their build script.
   
   However, direct reference of `sourceSets.test.output` is discouraged. Quote 
from https://discuss.gradle.org/t/evaluationdependson-annoyance/21783/2:
   >While you can reference tasks from other projects, it really should be a 
last resort. Using evaluationDependsOn is a bit of a smell. Dependencies, 
configurations, and artifacts are a better model than directly accessing 
outputs of a task from another project.
   
   So I would prefer not to add it to this script because we are not doing the 
direct referencing here.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 298186)
Time Spent: 24h  (was: 23h 50m)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 24h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-08-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=298178=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-298178
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 20/Aug/19 20:51
Start Date: 20/Aug/19 20:51
Worklog Time Spent: 10m 
  Work Description: robinyqiu commented on pull request #9144: [BEAM-7013] 
Integrating ZetaSketch's HLL++ algorithm with Beam
URL: https://github.com/apache/beam/pull/9144#discussion_r315874860
 
 

 ##
 File path: 
sdks/java/extensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/HllCountTest.java
 ##
 @@ -0,0 +1,373 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.beam.sdk.extensions.zetasketch;
+
+import com.google.zetasketch.HyperLogLogPlusPlus;
+import com.google.zetasketch.shaded.com.google.protobuf.ByteString;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.List;
+import org.apache.beam.sdk.Pipeline.PipelineExecutionException;
+import org.apache.beam.sdk.testing.NeedsRunner;
+import org.apache.beam.sdk.testing.PAssert;
+import org.apache.beam.sdk.testing.TestPipeline;
+import org.apache.beam.sdk.transforms.Create;
+import org.apache.beam.sdk.values.KV;
+import org.apache.beam.sdk.values.PCollection;
+import org.apache.beam.sdk.values.TypeDescriptor;
+import org.junit.Rule;
+import org.junit.Test;
+import org.junit.experimental.categories.Category;
+import org.junit.rules.ExpectedException;
+import org.junit.runner.RunWith;
+import org.junit.runners.JUnit4;
+
+/** Tests for {@link HllCount}. */
+@RunWith(JUnit4.class)
+public class HllCountTest {
+
+  @Rule public final transient TestPipeline p = TestPipeline.create();
+  @Rule public transient ExpectedException thrown = ExpectedException.none();
+
+  // Integer
+  private static final List INTS1 = Arrays.asList(1, 2, 3, 3, 1, 4);
+  private static final byte[] INTS1_SKETCH;
+  private static final Long INTS1_ESTIMATE;
+
+  static {
+HyperLogLogPlusPlus hll = new 
HyperLogLogPlusPlus.Builder().buildForIntegers();
+INTS1.forEach(hll::add);
+INTS1_SKETCH = hll.serializeToByteArray();
+INTS1_ESTIMATE = hll.longResult();
+  }
+
+  private static final List INTS2 = Arrays.asList(3, 3, 3, 3);
+  private static final byte[] INTS2_SKETCH;
+  private static final Long INTS2_ESTIMATE;
+
+  static {
+HyperLogLogPlusPlus hll = new 
HyperLogLogPlusPlus.Builder().buildForIntegers();
+INTS2.forEach(hll::add);
+INTS2_SKETCH = hll.serializeToByteArray();
+INTS2_ESTIMATE = hll.longResult();
+  }
+
+  private static final byte[] INTS1_INTS2_SKETCH;
+
+  static {
+HyperLogLogPlusPlus hll = HyperLogLogPlusPlus.forProto(INTS1_SKETCH);
+hll.merge(INTS2_SKETCH);
+INTS1_INTS2_SKETCH = hll.serializeToByteArray();
+  }
+
+  // Long
+  private static final List LONGS = Collections.singletonList(1L);
+  private static final byte[] LONGS_SKETCH;
+
+  static {
+HyperLogLogPlusPlus hll = new 
HyperLogLogPlusPlus.Builder().buildForLongs();
+LONGS.forEach(hll::add);
+LONGS_SKETCH = hll.serializeToByteArray();
+  }
+
+  private static final byte[] LONGS_EMPTY_SKETCH;
+
+  static {
+HyperLogLogPlusPlus hll = new 
HyperLogLogPlusPlus.Builder().buildForLongs();
+LONGS_EMPTY_SKETCH = hll.serializeToByteArray();
+  }
+
+  // String
+  private static final List STRINGS = Arrays.asList("s1", "s2", "s1", 
"s2");
+  private static final byte[] STRINGS_SKETCH;
+
+  static {
+HyperLogLogPlusPlus hll = new 
HyperLogLogPlusPlus.Builder().buildForStrings();
+STRINGS.forEach(hll::add);
+STRINGS_SKETCH = hll.serializeToByteArray();
+  }
+
+  private static final int TEST_PRECISION = 20;
+  private static final byte[] STRINGS_SKETCH_TEST_PRECISION;
+
+  static {
+HyperLogLogPlusPlus hll =
+new 
HyperLogLogPlusPlus.Builder().normalPrecision(TEST_PRECISION).buildForStrings();
+STRINGS.forEach(hll::add);
+STRINGS_SKETCH_TEST_PRECISION = hll.serializeToByteArray();
+  }
+
+  // Bytes
+  private static final byte[] BYTES0 = {(byte) 0x1, (byte) 

[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-08-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=298177=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-298177
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 20/Aug/19 20:50
Start Date: 20/Aug/19 20:50
Worklog Time Spent: 10m 
  Work Description: robinyqiu commented on pull request #9144: [BEAM-7013] 
Integrating ZetaSketch's HLL++ algorithm with Beam
URL: https://github.com/apache/beam/pull/9144#discussion_r315874860
 
 

 ##
 File path: 
sdks/java/extensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/HllCountTest.java
 ##
 @@ -0,0 +1,373 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.beam.sdk.extensions.zetasketch;
+
+import com.google.zetasketch.HyperLogLogPlusPlus;
+import com.google.zetasketch.shaded.com.google.protobuf.ByteString;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.List;
+import org.apache.beam.sdk.Pipeline.PipelineExecutionException;
+import org.apache.beam.sdk.testing.NeedsRunner;
+import org.apache.beam.sdk.testing.PAssert;
+import org.apache.beam.sdk.testing.TestPipeline;
+import org.apache.beam.sdk.transforms.Create;
+import org.apache.beam.sdk.values.KV;
+import org.apache.beam.sdk.values.PCollection;
+import org.apache.beam.sdk.values.TypeDescriptor;
+import org.junit.Rule;
+import org.junit.Test;
+import org.junit.experimental.categories.Category;
+import org.junit.rules.ExpectedException;
+import org.junit.runner.RunWith;
+import org.junit.runners.JUnit4;
+
+/** Tests for {@link HllCount}. */
+@RunWith(JUnit4.class)
+public class HllCountTest {
+
+  @Rule public final transient TestPipeline p = TestPipeline.create();
+  @Rule public transient ExpectedException thrown = ExpectedException.none();
+
+  // Integer
+  private static final List INTS1 = Arrays.asList(1, 2, 3, 3, 1, 4);
+  private static final byte[] INTS1_SKETCH;
+  private static final Long INTS1_ESTIMATE;
+
+  static {
+HyperLogLogPlusPlus hll = new 
HyperLogLogPlusPlus.Builder().buildForIntegers();
+INTS1.forEach(hll::add);
+INTS1_SKETCH = hll.serializeToByteArray();
+INTS1_ESTIMATE = hll.longResult();
+  }
+
+  private static final List INTS2 = Arrays.asList(3, 3, 3, 3);
+  private static final byte[] INTS2_SKETCH;
+  private static final Long INTS2_ESTIMATE;
+
+  static {
+HyperLogLogPlusPlus hll = new 
HyperLogLogPlusPlus.Builder().buildForIntegers();
+INTS2.forEach(hll::add);
+INTS2_SKETCH = hll.serializeToByteArray();
+INTS2_ESTIMATE = hll.longResult();
+  }
+
+  private static final byte[] INTS1_INTS2_SKETCH;
+
+  static {
+HyperLogLogPlusPlus hll = HyperLogLogPlusPlus.forProto(INTS1_SKETCH);
+hll.merge(INTS2_SKETCH);
+INTS1_INTS2_SKETCH = hll.serializeToByteArray();
+  }
+
+  // Long
+  private static final List LONGS = Collections.singletonList(1L);
+  private static final byte[] LONGS_SKETCH;
+
+  static {
+HyperLogLogPlusPlus hll = new 
HyperLogLogPlusPlus.Builder().buildForLongs();
+LONGS.forEach(hll::add);
+LONGS_SKETCH = hll.serializeToByteArray();
+  }
+
+  private static final byte[] LONGS_EMPTY_SKETCH;
+
+  static {
+HyperLogLogPlusPlus hll = new 
HyperLogLogPlusPlus.Builder().buildForLongs();
+LONGS_EMPTY_SKETCH = hll.serializeToByteArray();
+  }
+
+  // String
+  private static final List STRINGS = Arrays.asList("s1", "s2", "s1", 
"s2");
+  private static final byte[] STRINGS_SKETCH;
+
+  static {
+HyperLogLogPlusPlus hll = new 
HyperLogLogPlusPlus.Builder().buildForStrings();
+STRINGS.forEach(hll::add);
+STRINGS_SKETCH = hll.serializeToByteArray();
+  }
+
+  private static final int TEST_PRECISION = 20;
+  private static final byte[] STRINGS_SKETCH_TEST_PRECISION;
+
+  static {
+HyperLogLogPlusPlus hll =
+new 
HyperLogLogPlusPlus.Builder().normalPrecision(TEST_PRECISION).buildForStrings();
+STRINGS.forEach(hll::add);
+STRINGS_SKETCH_TEST_PRECISION = hll.serializeToByteArray();
+  }
+
+  // Bytes
+  private static final byte[] BYTES0 = {(byte) 0x1, (byte) 

[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-08-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=298174=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-298174
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 20/Aug/19 20:49
Start Date: 20/Aug/19 20:49
Worklog Time Spent: 10m 
  Work Description: robinyqiu commented on pull request #9144: [BEAM-7013] 
Integrating ZetaSketch's HLL++ algorithm with Beam
URL: https://github.com/apache/beam/pull/9144#discussion_r315874860
 
 

 ##
 File path: 
sdks/java/extensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/HllCountTest.java
 ##
 @@ -0,0 +1,373 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.beam.sdk.extensions.zetasketch;
+
+import com.google.zetasketch.HyperLogLogPlusPlus;
+import com.google.zetasketch.shaded.com.google.protobuf.ByteString;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.List;
+import org.apache.beam.sdk.Pipeline.PipelineExecutionException;
+import org.apache.beam.sdk.testing.NeedsRunner;
+import org.apache.beam.sdk.testing.PAssert;
+import org.apache.beam.sdk.testing.TestPipeline;
+import org.apache.beam.sdk.transforms.Create;
+import org.apache.beam.sdk.values.KV;
+import org.apache.beam.sdk.values.PCollection;
+import org.apache.beam.sdk.values.TypeDescriptor;
+import org.junit.Rule;
+import org.junit.Test;
+import org.junit.experimental.categories.Category;
+import org.junit.rules.ExpectedException;
+import org.junit.runner.RunWith;
+import org.junit.runners.JUnit4;
+
+/** Tests for {@link HllCount}. */
+@RunWith(JUnit4.class)
+public class HllCountTest {
+
+  @Rule public final transient TestPipeline p = TestPipeline.create();
+  @Rule public transient ExpectedException thrown = ExpectedException.none();
+
+  // Integer
+  private static final List INTS1 = Arrays.asList(1, 2, 3, 3, 1, 4);
+  private static final byte[] INTS1_SKETCH;
+  private static final Long INTS1_ESTIMATE;
+
+  static {
+HyperLogLogPlusPlus hll = new 
HyperLogLogPlusPlus.Builder().buildForIntegers();
+INTS1.forEach(hll::add);
+INTS1_SKETCH = hll.serializeToByteArray();
+INTS1_ESTIMATE = hll.longResult();
+  }
+
+  private static final List INTS2 = Arrays.asList(3, 3, 3, 3);
+  private static final byte[] INTS2_SKETCH;
+  private static final Long INTS2_ESTIMATE;
+
+  static {
+HyperLogLogPlusPlus hll = new 
HyperLogLogPlusPlus.Builder().buildForIntegers();
+INTS2.forEach(hll::add);
+INTS2_SKETCH = hll.serializeToByteArray();
+INTS2_ESTIMATE = hll.longResult();
+  }
+
+  private static final byte[] INTS1_INTS2_SKETCH;
+
+  static {
+HyperLogLogPlusPlus hll = HyperLogLogPlusPlus.forProto(INTS1_SKETCH);
+hll.merge(INTS2_SKETCH);
+INTS1_INTS2_SKETCH = hll.serializeToByteArray();
+  }
+
+  // Long
+  private static final List LONGS = Collections.singletonList(1L);
+  private static final byte[] LONGS_SKETCH;
+
+  static {
+HyperLogLogPlusPlus hll = new 
HyperLogLogPlusPlus.Builder().buildForLongs();
+LONGS.forEach(hll::add);
+LONGS_SKETCH = hll.serializeToByteArray();
+  }
+
+  private static final byte[] LONGS_EMPTY_SKETCH;
+
+  static {
+HyperLogLogPlusPlus hll = new 
HyperLogLogPlusPlus.Builder().buildForLongs();
+LONGS_EMPTY_SKETCH = hll.serializeToByteArray();
+  }
+
+  // String
+  private static final List STRINGS = Arrays.asList("s1", "s2", "s1", 
"s2");
+  private static final byte[] STRINGS_SKETCH;
+
+  static {
+HyperLogLogPlusPlus hll = new 
HyperLogLogPlusPlus.Builder().buildForStrings();
+STRINGS.forEach(hll::add);
+STRINGS_SKETCH = hll.serializeToByteArray();
+  }
+
+  private static final int TEST_PRECISION = 20;
+  private static final byte[] STRINGS_SKETCH_TEST_PRECISION;
+
+  static {
+HyperLogLogPlusPlus hll =
+new 
HyperLogLogPlusPlus.Builder().normalPrecision(TEST_PRECISION).buildForStrings();
+STRINGS.forEach(hll::add);
+STRINGS_SKETCH_TEST_PRECISION = hll.serializeToByteArray();
+  }
+
+  // Bytes
+  private static final byte[] BYTES0 = {(byte) 0x1, (byte) 

[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-08-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=298158=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-298158
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 20/Aug/19 20:40
Start Date: 20/Aug/19 20:40
Worklog Time Spent: 10m 
  Work Description: robinyqiu commented on pull request #9144: [BEAM-7013] 
Integrating ZetaSketch's HLL++ algorithm with Beam
URL: https://github.com/apache/beam/pull/9144#discussion_r315892837
 
 

 ##
 File path: 
sdks/java/extensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/HllCountTest.java
 ##
 @@ -0,0 +1,373 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.beam.sdk.extensions.zetasketch;
+
+import com.google.zetasketch.HyperLogLogPlusPlus;
+import com.google.zetasketch.shaded.com.google.protobuf.ByteString;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.List;
+import org.apache.beam.sdk.Pipeline.PipelineExecutionException;
+import org.apache.beam.sdk.testing.NeedsRunner;
+import org.apache.beam.sdk.testing.PAssert;
+import org.apache.beam.sdk.testing.TestPipeline;
+import org.apache.beam.sdk.transforms.Create;
+import org.apache.beam.sdk.values.KV;
+import org.apache.beam.sdk.values.PCollection;
+import org.apache.beam.sdk.values.TypeDescriptor;
+import org.junit.Rule;
+import org.junit.Test;
+import org.junit.experimental.categories.Category;
+import org.junit.rules.ExpectedException;
+import org.junit.runner.RunWith;
+import org.junit.runners.JUnit4;
+
+/** Tests for {@link HllCount}. */
+@RunWith(JUnit4.class)
 
 Review comment:
   It is included.
   
   Locally you can run `./gradlew :sdks:java:extensions:zetasketch:test` to 
execute the tests.
   
   On Jenkins it is included as Java PreCommit test: 
https://builds.apache.org/job/beam_PreCommit_Java_Commit/7374/testReport/org.apache.beam.sdk.extensions.zetasketch/
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 298158)
Time Spent: 23h 20m  (was: 23h 10m)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 23h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


  1   2   3   >