[jira] [Work logged] (BEAM-7866) Python MongoDB IO performance and correctness issues

2019-08-04 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7866?focusedWorklogId=288683=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288683
 ]

ASF GitHub Bot logged work on BEAM-7866:


Author: ASF GitHub Bot
Created on: 05/Aug/19 05:37
Start Date: 05/Aug/19 05:37
Worklog Time Spent: 10m 
  Work Description: y1chi commented on pull request #9233:  [BEAM-7866] Fix 
python ReadFromMongoDB potential data loss issue
URL: https://github.com/apache/beam/pull/9233#discussion_r310443556
 
 

 ##
 File path: sdks/python/apache_beam/io/mongodbio_test.py
 ##
 @@ -30,38 +32,102 @@
 from apache_beam.io.mongodbio import _BoundedMongoSource
 from apache_beam.io.mongodbio import _GenerateObjectIdFn
 from apache_beam.io.mongodbio import _MongoSink
+from apache_beam.io.mongodbio import _ObjectIdRangeTracker
 from apache_beam.io.mongodbio import _WriteMongoFn
 from apache_beam.testing.test_pipeline import TestPipeline
 from apache_beam.testing.util import assert_that
 from apache_beam.testing.util import equal_to
 
 
+class _MockMongoColl(object):
+  def __init__(self, docs):
+self.docs = docs
+
+  def _filter(self, filter):
+match = []
+if not filter:
+  return self
+start = filter['_id'].get('$gte')
+end = filter['_id'].get('$lt')
+for doc in self.docs:
+  if start and doc['_id'] < start:
+continue
+  if end and doc['_id'] >= end:
+continue
+  match.append(doc)
+return match
+
+  def find(self, filter=None, **kwargs):
+return _MockMongoColl(self._filter(filter))
+
+  def sort(self, sort_items):
+key, order = sort_items[0]
+self.docs = sorted(self.docs,
+   key=lambda x: x[key],
+   reverse=(order != ASCENDING))
+return self
+
+  def limit(self, num):
+return _MockMongoColl(self.docs[0:num])
+
+  def count_documents(self, filter):
+return len(self._filter(filter))
+
+  def __getitem__(self, item):
+return self.docs[item]
+
+
 class MongoSourceTest(unittest.TestCase):
-  @mock.patch('apache_beam.io.mongodbio._BoundedMongoSource'
-  '._get_document_count')
-  @mock.patch('apache_beam.io.mongodbio._BoundedMongoSource'
-  '._get_avg_document_size')
-  def setUp(self, mock_size, mock_count):
-mock_size.return_value = 10
-mock_count.return_value = 5
+  @mock.patch('apache_beam.io.mongodbio.MongoClient')
+  def setUp(self, mock_client):
+mock_client.return_value.__enter__.return_value.__getitem__ \
+  .return_value.command.return_value = {'size': 5, 'avgSize': 1}
+self._ids = [
+objectid.ObjectId.from_datetime(
+datetime.datetime(year=2020, month=i + 1, day=i + 1))
+for i in range(5)
+]
+self._docs = [{'_id': self._ids[i], 'x': i} for i in range(len(self._ids))]
+
 self.mongo_source = _BoundedMongoSource('mongodb://test', 'testdb',
 'testcoll')
 
-  def test_estimate_size(self):
-self.assertEqual(self.mongo_source.estimate_size(), 50)
+  def get_split(self, command, ns, min, max, maxChunkSize, **kwargs):
 
 Review comment:
   maxChunkSize is the actual argument name mango client needed(maybe by 
mistake) so I'm not able to change it to snake case.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 288683)
Time Spent: 2h 50m  (was: 2h 40m)

> Python MongoDB IO performance and correctness issues
> 
>
> Key: BEAM-7866
> URL: https://issues.apache.org/jira/browse/BEAM-7866
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Eugene Kirpichov
>Assignee: Yichi Zhang
>Priority: Blocker
> Fix For: 2.15.0
>
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/mongodbio.py
>  splits the query result by computing number of results in constructor, and 
> then in each reader re-executing the whole query and getting an index 
> sub-range of those results.
> This is broken in several critical ways:
> - The order of query results returned by find() is not necessarily 
> deterministic, so the idea of index ranges on it is meaningless: each shard 
> may basically get random, possibly overlapping subsets of the total results
> - Even if you add order by `_id`, the database may be changing concurrently 
> to reading and splitting. E.g. if the database contained documents with ids 
> 

[jira] [Work logged] (BEAM-7866) Python MongoDB IO performance and correctness issues

2019-08-04 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7866?focusedWorklogId=288684=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288684
 ]

ASF GitHub Bot logged work on BEAM-7866:


Author: ASF GitHub Bot
Created on: 05/Aug/19 05:37
Start Date: 05/Aug/19 05:37
Worklog Time Spent: 10m 
  Work Description: y1chi commented on pull request #9233:  [BEAM-7866] Fix 
python ReadFromMongoDB potential data loss issue
URL: https://github.com/apache/beam/pull/9233#discussion_r310443577
 
 

 ##
 File path: sdks/python/apache_beam/io/mongodbio_test.py
 ##
 @@ -30,38 +32,102 @@
 from apache_beam.io.mongodbio import _BoundedMongoSource
 from apache_beam.io.mongodbio import _GenerateObjectIdFn
 from apache_beam.io.mongodbio import _MongoSink
+from apache_beam.io.mongodbio import _ObjectIdRangeTracker
 from apache_beam.io.mongodbio import _WriteMongoFn
 from apache_beam.testing.test_pipeline import TestPipeline
 from apache_beam.testing.util import assert_that
 from apache_beam.testing.util import equal_to
 
 
+class _MockMongoColl(object):
+  def __init__(self, docs):
+self.docs = docs
+
+  def _filter(self, filter):
+match = []
+if not filter:
+  return self
+start = filter['_id'].get('$gte')
+end = filter['_id'].get('$lt')
+for doc in self.docs:
+  if start and doc['_id'] < start:
+continue
+  if end and doc['_id'] >= end:
+continue
+  match.append(doc)
+return match
+
+  def find(self, filter=None, **kwargs):
+return _MockMongoColl(self._filter(filter))
+
+  def sort(self, sort_items):
+key, order = sort_items[0]
+self.docs = sorted(self.docs,
+   key=lambda x: x[key],
+   reverse=(order != ASCENDING))
+return self
+
+  def limit(self, num):
+return _MockMongoColl(self.docs[0:num])
+
+  def count_documents(self, filter):
+return len(self._filter(filter))
+
+  def __getitem__(self, item):
+return self.docs[item]
+
+
 class MongoSourceTest(unittest.TestCase):
-  @mock.patch('apache_beam.io.mongodbio._BoundedMongoSource'
-  '._get_document_count')
-  @mock.patch('apache_beam.io.mongodbio._BoundedMongoSource'
-  '._get_avg_document_size')
-  def setUp(self, mock_size, mock_count):
-mock_size.return_value = 10
-mock_count.return_value = 5
+  @mock.patch('apache_beam.io.mongodbio.MongoClient')
+  def setUp(self, mock_client):
+mock_client.return_value.__enter__.return_value.__getitem__ \
 
 Review comment:
   make sense.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 288684)
Time Spent: 3h  (was: 2h 50m)

> Python MongoDB IO performance and correctness issues
> 
>
> Key: BEAM-7866
> URL: https://issues.apache.org/jira/browse/BEAM-7866
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Eugene Kirpichov
>Assignee: Yichi Zhang
>Priority: Blocker
> Fix For: 2.15.0
>
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/mongodbio.py
>  splits the query result by computing number of results in constructor, and 
> then in each reader re-executing the whole query and getting an index 
> sub-range of those results.
> This is broken in several critical ways:
> - The order of query results returned by find() is not necessarily 
> deterministic, so the idea of index ranges on it is meaningless: each shard 
> may basically get random, possibly overlapping subsets of the total results
> - Even if you add order by `_id`, the database may be changing concurrently 
> to reading and splitting. E.g. if the database contained documents with ids 
> 10 20 30 40 50, and this was split into shards 0..2 and 3..5 (under the 
> assumption that these shards would contain respectively 10 20 30, and 40 50), 
> and then suppose shard 10 20 30 is read and then document 25 is inserted - 
> then the 3..5 shard will read 30 40 50, i.e. document 30 is duplicated and 
> document 25 is lost.
> - Every shard re-executes the query and skips the first start_offset items, 
> which in total is quadratic complexity
> - The query is first executed in the constructor in order to count results, 
> which 1) means the constructor can be super slow and 2) it won't work at all 
> if the database is unavailable at the time the pipeline is constructed (e.g. 
> if this is a template).
> 

[jira] [Work logged] (BEAM-7866) Python MongoDB IO performance and correctness issues

2019-08-04 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7866?focusedWorklogId=288682=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288682
 ]

ASF GitHub Bot logged work on BEAM-7866:


Author: ASF GitHub Bot
Created on: 05/Aug/19 05:36
Start Date: 05/Aug/19 05:36
Worklog Time Spent: 10m 
  Work Description: y1chi commented on pull request #9233:  [BEAM-7866] Fix 
python ReadFromMongoDB potential data loss issue
URL: https://github.com/apache/beam/pull/9233#discussion_r310443397
 
 

 ##
 File path: sdks/python/apache_beam/io/mongodbio.py
 ##
 @@ -194,18 +225,72 @@ def display_data(self):
 res['mongo_client_spec'] = self.spec
 return res
 
-  def _get_avg_document_size(self):
+  def _get_split_keys(self, desired_chunk_size, start_pos, end_pos):
+# if desired chunk size smaller than 1mb, use mongodb default split size of
+# 1mb
+if desired_chunk_size < 1:
 
 Review comment:
   yes in Mb. renamed the argument for clarity.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 288682)
Time Spent: 2h 40m  (was: 2.5h)

> Python MongoDB IO performance and correctness issues
> 
>
> Key: BEAM-7866
> URL: https://issues.apache.org/jira/browse/BEAM-7866
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Eugene Kirpichov
>Assignee: Yichi Zhang
>Priority: Blocker
> Fix For: 2.15.0
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/mongodbio.py
>  splits the query result by computing number of results in constructor, and 
> then in each reader re-executing the whole query and getting an index 
> sub-range of those results.
> This is broken in several critical ways:
> - The order of query results returned by find() is not necessarily 
> deterministic, so the idea of index ranges on it is meaningless: each shard 
> may basically get random, possibly overlapping subsets of the total results
> - Even if you add order by `_id`, the database may be changing concurrently 
> to reading and splitting. E.g. if the database contained documents with ids 
> 10 20 30 40 50, and this was split into shards 0..2 and 3..5 (under the 
> assumption that these shards would contain respectively 10 20 30, and 40 50), 
> and then suppose shard 10 20 30 is read and then document 25 is inserted - 
> then the 3..5 shard will read 30 40 50, i.e. document 30 is duplicated and 
> document 25 is lost.
> - Every shard re-executes the query and skips the first start_offset items, 
> which in total is quadratic complexity
> - The query is first executed in the constructor in order to count results, 
> which 1) means the constructor can be super slow and 2) it won't work at all 
> if the database is unavailable at the time the pipeline is constructed (e.g. 
> if this is a template).
> Unfortunately, none of these issues are caught by SourceTestUtils: this class 
> has extensive coverage with it, and the tests pass. This is because the tests 
> return the same results in the same order. I don't know how to catch this 
> automatically, and I don't know how to catch the performance issue 
> automatically, but these would all be important follow-up items after the 
> actual fix.
> CC: [~chamikara] as reviewer.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7866) Python MongoDB IO performance and correctness issues

2019-08-04 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7866?focusedWorklogId=288681=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288681
 ]

ASF GitHub Bot logged work on BEAM-7866:


Author: ASF GitHub Bot
Created on: 05/Aug/19 05:35
Start Date: 05/Aug/19 05:35
Worklog Time Spent: 10m 
  Work Description: y1chi commented on pull request #9233:  [BEAM-7866] Fix 
python ReadFromMongoDB potential data loss issue
URL: https://github.com/apache/beam/pull/9233#discussion_r310443330
 
 

 ##
 File path: sdks/python/apache_beam/io/mongodbio.py
 ##
 @@ -139,50 +143,77 @@ def __init__(self,
 self.filter = filter
 self.projection = projection
 self.spec = extra_client_params
-self.doc_count = self._get_document_count()
-self.avg_doc_size = self._get_avg_document_size()
-self.client = None
 
   def estimate_size(self):
-return self.avg_doc_size * self.doc_count
+with MongoClient(self.uri, **self.spec) as client:
+  size = client[self.db].command('collstats', self.coll).get('size')
+  if size is None or size <= 0:
+raise ValueError('Collection %s not found or total doc size is '
+ 'incorrect' % self.coll)
+  return size
 
   def split(self, desired_bundle_size, start_position=None, 
stop_position=None):
 # use document cursor index as the start and stop positions
 if start_position is None:
-  start_position = 0
+  epoch = datetime.datetime(1970, 1, 1)
+  start_position = objectid.ObjectId.from_datetime(epoch)
 if stop_position is None:
-  stop_position = self.doc_count
+  last_doc_id = self._get_last_document_id()
+  # add one sec to make sure the last document is not excluded
+  last_timestamp_plus_one_sec = (last_doc_id.generation_time +
+ datetime.timedelta(seconds=1))
+  stop_position = objectid.ObjectId.from_datetime(
+  last_timestamp_plus_one_sec)
 
-# get an estimate on how many documents should be included in a split batch
-desired_bundle_count = desired_bundle_size // self.avg_doc_size
+desired_bundle_size_in_mb = desired_bundle_size // 1024 // 1024
+split_keys = self._get_split_keys(desired_bundle_size_in_mb, 
start_position,
+  stop_position)
 
 bundle_start = start_position
-while bundle_start < stop_position:
-  bundle_end = min(stop_position, bundle_start + desired_bundle_count)
-  yield iobase.SourceBundle(weight=bundle_end - bundle_start,
+for split_key_id in split_keys:
+  bundle_end = min(stop_position, split_key_id)
+  if bundle_start is None and bundle_start < stop_position:
+return
+  yield iobase.SourceBundle(weight=desired_bundle_size_in_mb,
 source=self,
 start_position=bundle_start,
 stop_position=bundle_end)
   bundle_start = bundle_end
+# add range of last split_key to stop_position
+if bundle_start < stop_position:
+  yield iobase.SourceBundle(weight=desired_bundle_size_in_mb,
+source=self,
+start_position=bundle_start,
+stop_position=stop_position)
 
   def get_range_tracker(self, start_position, stop_position):
 if start_position is None:
-  start_position = 0
+  epoch = datetime.datetime(1970, 1, 1)
+  start_position = objectid.ObjectId.from_datetime(epoch)
 if stop_position is None:
-  stop_position = self.doc_count
-return OffsetRangeTracker(start_position, stop_position)
+  last_doc_id = self._get_last_document_id()
+  # add one sec to make sure the last document is not excluded
+  last_timestamp_plus_one_sec = (last_doc_id.generation_time +
+ datetime.timedelta(seconds=1))
+  stop_position = objectid.ObjectId.from_datetime(
+  last_timestamp_plus_one_sec)
+return _ObjectIdRangeTracker(start_position, stop_position)
 
   def read(self, range_tracker):
 with MongoClient(self.uri, **self.spec) as client:
-  # docs is a MongoDB Cursor
-  docs = client[self.db][self.coll].find(
-  filter=self.filter, projection=self.projection
-  )[range_tracker.start_position():range_tracker.stop_position()]
-  for index in range(range_tracker.start_position(),
- range_tracker.stop_position()):
-if not range_tracker.try_claim(index):
+  all_filters = self.filter
+  all_filters.update({
+  '_id': {
+  '$gte': range_tracker.start_position(),
 
 Review comment:
   Added the filter merge logic.
 

This is an automated message from the Apache Git Service.
To respond to the message, 

[jira] [Resolved] (BEAM-7141) Expose kv and window parameters for on_timer

2019-08-04 Thread Thomas Weise (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Weise resolved BEAM-7141.

   Resolution: Fixed
Fix Version/s: 2.14.0

> Expose kv and window parameters for on_timer
> 
>
> Key: BEAM-7141
> URL: https://issues.apache.org/jira/browse/BEAM-7141
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py-core
>Affects Versions: 2.12.0
>Reporter: Thomas Weise
>Assignee: Rakesh Kumar
>Priority: Major
> Fix For: 2.14.0
>
>  Time Spent: 6h 20m
>  Remaining Estimate: 0h
>
> We would like to have access to key and window inside the timer callback. 
> Without, it is also difficult to debug. We run into this while working on 
> BEAM-7112



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (BEAM-7074) FnApiRunner fails to wire multiple timer specs in single pardo

2019-08-04 Thread Rakesh Kumar (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-7074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899738#comment-16899738
 ] 

Rakesh Kumar commented on BEAM-7074:


I also ran into the same problem when I was writing test cases for {{SetState}}.

> FnApiRunner fails to wire multiple timer specs in single pardo
> --
>
> Key: BEAM-7074
> URL: https://issues.apache.org/jira/browse/BEAM-7074
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-harness
>Affects Versions: 2.11.0
>Reporter: Thomas Weise
>Priority: Major
>
> Multiple timer specs in a ParDo yield "NotImplementedError: Timers and side 
> inputs."



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7874) FnApi only supports up to 10 workers

2019-08-04 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7874?focusedWorklogId=288657=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288657
 ]

ASF GitHub Bot logged work on BEAM-7874:


Author: ASF GitHub Bot
Created on: 05/Aug/19 02:36
Start Date: 05/Aug/19 02:36
Worklog Time Spent: 10m 
  Work Description: Hannah-Jiang commented on pull request #9218: 
[BEAM-7874], [BEAM-7873] FnApi bugfixs
URL: https://github.com/apache/beam/pull/9218#discussion_r310422730
 
 

 ##
 File path: sdks/python/apache_beam/runners/portability/fn_api_runner.py
 ##
 @@ -1319,55 +1319,62 @@ def stop_worker(self):
 @WorkerHandler.register_environment(common_urns.environments.DOCKER.urn,
 beam_runner_api_pb2.DockerPayload)
 class DockerSdkWorkerHandler(GrpcWorkerHandler):
+
+  _lock = threading.Lock()
+
   def __init__(self, payload, state, provision_info, grpc_server):
 super(DockerSdkWorkerHandler, self).__init__(state, provision_info,
  grpc_server)
 self._container_image = payload.container_image
 self._container_id = None
 
   def start_worker(self):
-try:
-  subprocess.check_call(['docker', 'pull', self._container_image])
-except Exception:
-  logging.info('Unable to pull image %s' % self._container_image)
-self._container_id = subprocess.check_output(
-['docker',
- 'run',
- '-d',
- # TODO:  credentials
- '--network=host',
- self._container_image,
- '--id=%s' % uuid.uuid4(),
- '--logging_endpoint=%s' % self.logging_api_service_descriptor().url,
- '--control_endpoint=%s' % self.control_address,
- '--artifact_endpoint=%s' % self.control_address,
- '--provision_endpoint=%s' % self.control_address,
-]).strip()
-while True:
-  logging.info('Waiting for docker to start up...')
-  status = subprocess.check_output([
-  'docker',
-  'inspect',
-  '-f',
-  '{{.State.Status}}',
-  self._container_id]).strip()
-  if status == 'running':
-break
-  elif status in ('dead', 'exited'):
-subprocess.call([
+with DockerSdkWorkerHandler._lock:
+  try:
+subprocess.check_call(['docker', 'pull', self._container_image])
+  except Exception:
+logging.info('Unable to pull image %s' % self._container_image)
+  self._container_id = subprocess.check_output(
 
 Review comment:
   I just noticed that DockerSdkWorkerHandler can be improved. It calls 
`apache_beam.runners.worker.sdk_worker_main` at bootstrap, so  rather than 
creating multi containers, we only need to pass process # to sdk_worker_main. I 
am going to work on customized container in Q3, will improve it as part of 
customized container project.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 288657)
Time Spent: 1h 10m  (was: 1h)

> FnApi only supports up to 10 workers
> 
>
> Key: BEAM-7874
> URL: https://issues.apache.org/jira/browse/BEAM-7874
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Hannah Jiang
>Assignee: Hannah Jiang
>Priority: Blocker
> Fix For: 2.15.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Because max_workers of grpc servers are hardcoded to 10, it only supports up 
> to 10 workers, and if we pass more direct_num_workers greater than 10, 
> pipeline hangs, because not all workers get connected to the runner.
> [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/portability/fn_api_runner.py#L1141]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7060) Design Py3-compatible typehints annotation support in Beam 3.

2019-08-04 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7060?focusedWorklogId=288646=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288646
 ]

ASF GitHub Bot logged work on BEAM-7060:


Author: ASF GitHub Bot
Created on: 05/Aug/19 00:34
Start Date: 05/Aug/19 00:34
Worklog Time Spent: 10m 
  Work Description: amaliujia commented on issue #9223: [BEAM-7060] 
Introduce Python3-only test modules
URL: https://github.com/apache/beam/pull/9223#issuecomment-518049702
 
 
   Hello!
   
   I am thinking if this PR causes `./gradlew spotlessApply` fail in master 
branch.
   
   Run `./gradlew spotlessApply` gives the following error message on master 
branch:
   
   ```
   * Where:
   Build file 
'/Users/ruwang/Documents/beam/sdks/python/apache_beam/testing/load_tests/build.gradle'
 line: 44
   
   * What went wrong:
   A problem occurred evaluating project 
':sdks:python:apache_beam:testing:load_tests'.
   > Could not get unknown property 'files' for task 
':sdks:python:apache_beam:testing:load_tests:run' of type 
org.gradle.api.tasks.Exec.
   
   * Try:
   Run with --stacktrace option to get the stack trace. Run with --info or 
--debug option to get more log output. Run with --scan to get full insights.
   ```
   
   
   I am working on a macbook.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 288646)
Time Spent: 9.5h  (was: 9h 20m)

> Design Py3-compatible typehints annotation support in Beam 3.
> -
>
> Key: BEAM-7060
> URL: https://issues.apache.org/jira/browse/BEAM-7060
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Udi Meiri
>Priority: Major
>  Time Spent: 9.5h
>  Remaining Estimate: 0h
>
> Existing [Typehints implementaiton in 
> Beam|[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/typehints/
> ] heavily relies on internal details of CPython implementation, and some of 
> the assumptions of this implementation broke as of Python 3.6, see for 
> example: https://issues.apache.org/jira/browse/BEAM-6877, which makes  
> typehints support unusable on Python 3.6 as of now. [Python 3 Kanban 
> Board|https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=245=detail]
>  lists several specific typehints-related breakages, prefixed with "TypeHints 
> Py3 Error".
> We need to decide whether to:
> - Deprecate in-house typehints implementation.
> - Continue to support in-house implementation, which at this point is a stale 
> code and has other known issues.
> - Attempt to use some off-the-shelf libraries for supporting 
> type-annotations, like  Pytype, Mypy, PyAnnotate.
> WRT to this decision we also need to plan on immediate next steps to unblock 
> adoption of Beam for  Python 3.6+ users. One potential option may be to have 
> Beam SDK ignore any typehint annotations on Py 3.6+.
> cc: [~udim], [~altay], [~robertwb].



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7060) Design Py3-compatible typehints annotation support in Beam 3.

2019-08-04 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7060?focusedWorklogId=288645=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288645
 ]

ASF GitHub Bot logged work on BEAM-7060:


Author: ASF GitHub Bot
Created on: 05/Aug/19 00:33
Start Date: 05/Aug/19 00:33
Worklog Time Spent: 10m 
  Work Description: amaliujia commented on issue #9223: [BEAM-7060] 
Introduce Python3-only test modules
URL: https://github.com/apache/beam/pull/9223#issuecomment-518049702
 
 
   Hello!
   
   I am thinking if this PR causes `./gradlew spotlessApply` fail in master 
branch.
   
   Run `./gradlew spotlessApply` gives the following error message on master 
branch:
   
   ```
   * Where:
   Build file 
'/Users/ruwang/Documents/beam/sdks/python/apache_beam/testing/load_tests/build.gradle'
 line: 44
   
   * What went wrong:
   A problem occurred evaluating project 
':sdks:python:apache_beam:testing:load_tests'.
   > Could not get unknown property 'files' for task 
':sdks:python:apache_beam:testing:load_tests:run' of type 
org.gradle.api.tasks.Exec.
   
   * Try:
   Run with --stacktrace option to get the stack trace. Run with --info or 
--debug option to get more log output. Run with --scan to get full insights.
   ```
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 288645)
Time Spent: 9h 20m  (was: 9h 10m)

> Design Py3-compatible typehints annotation support in Beam 3.
> -
>
> Key: BEAM-7060
> URL: https://issues.apache.org/jira/browse/BEAM-7060
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Udi Meiri
>Priority: Major
>  Time Spent: 9h 20m
>  Remaining Estimate: 0h
>
> Existing [Typehints implementaiton in 
> Beam|[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/typehints/
> ] heavily relies on internal details of CPython implementation, and some of 
> the assumptions of this implementation broke as of Python 3.6, see for 
> example: https://issues.apache.org/jira/browse/BEAM-6877, which makes  
> typehints support unusable on Python 3.6 as of now. [Python 3 Kanban 
> Board|https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=245=detail]
>  lists several specific typehints-related breakages, prefixed with "TypeHints 
> Py3 Error".
> We need to decide whether to:
> - Deprecate in-house typehints implementation.
> - Continue to support in-house implementation, which at this point is a stale 
> code and has other known issues.
> - Attempt to use some off-the-shelf libraries for supporting 
> type-annotations, like  Pytype, Mypy, PyAnnotate.
> WRT to this decision we also need to plan on immediate next steps to unblock 
> adoption of Beam for  Python 3.6+ users. One potential option may be to have 
> Beam SDK ignore any typehint annotations on Py 3.6+.
> cc: [~udim], [~altay], [~robertwb].



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-08-04 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=288641=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288641
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 05/Aug/19 00:21
Start Date: 05/Aug/19 00:21
Worklog Time Spent: 10m 
  Work Description: boyuanzz commented on pull request #9144: [BEAM-7013] 
Integrating ZetaSketch's HLL++ algorithm with Beam
URL: https://github.com/apache/beam/pull/9144#discussion_r310411036
 
 

 ##
 File path: 
sdks/java/extensions/zetasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCount.java
 ##
 @@ -0,0 +1,345 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.beam.sdk.extensions.zetasketch;
+
+import com.google.zetasketch.HyperLogLogPlusPlus;
+import org.apache.beam.sdk.annotations.Experimental;
+import org.apache.beam.sdk.transforms.Combine;
+import org.apache.beam.sdk.transforms.DoFn;
+import org.apache.beam.sdk.transforms.PTransform;
+import org.apache.beam.sdk.transforms.ParDo;
+import org.apache.beam.sdk.values.KV;
+import org.apache.beam.sdk.values.PCollection;
+
+/**
+ * {@code PTransform}s to compute HyperLogLogPlusPlus (HLL++) sketches on data 
streams based on the
+ * https://github.com/google/zetasketch;>ZetaSketch 
implementation.
+ *
+ * HLL++ is an algorithm implemented by Google that estimates the count of 
distinct elements in a
+ * data stream. HLL++ requires significantly less memory than the linear 
memory needed for exact
+ * computation, at the cost of a small error. Cardinalities of arbitrary 
breakdowns can be computed
+ * using the HLL++ sketch. See this http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/40671.pdf;>published
+ * paper for details about the algorithm.
+ *
+ * HLL++ functions are also supported in https://cloud.google.com/bigquery/docs/reference/standard-sql/hll_functions;>Google
 Cloud
+ * BigQuery. Using the {@code HllCount PTransform}s makes the 
interoperation with BigQuery
+ * easier.
+ *
+ * Examples
+ *
+ * Example 1: Create long-type sketch for a {@code PCollection} and 
specify precision
+ *
+ * {@code
+ * PCollection input = ...;
+ * int p = ...;
+ * PCollection sketch = 
input.apply(HllCount.Init.longSketch().withPrecision(p).globally());
+ * }
+ *
+ * Example 2: Create bytes-type sketch for a {@code PCollection>}
+ *
+ * {@code
+ * PCollection> input = ...;
+ * PCollection> sketch = 
input.apply(HllCount.Init.bytesSketch().perKey());
+ * }
+ *
+ * Example 3: Merge existing sketches in a {@code PCollection} 
into a new one
+ *
+ * {@code
+ * PCollection sketches = ...;
+ * PCollection mergedSketch = 
sketches.apply(HllCount.MergePartial.globally());
+ * }
+ *
+ * Example 4: Estimates the count of distinct elements in a {@code 
PCollection}
+ *
+ * {@code
+ * PCollection input = ...;
+ * PCollection countDistinct =
+ * 
input.apply(HllCount.Init.stringSketch().globally()).apply(HllCount.Extract.globally());
+ * }
+ */
+@Experimental
+public final class HllCount {
+
+  public static final int MINIMUM_PRECISION = 
HyperLogLogPlusPlus.MINIMUM_PRECISION;
+  public static final int MAXIMUM_PRECISION = 
HyperLogLogPlusPlus.MAXIMUM_PRECISION;
+  public static final int DEFAULT_PRECISION = 
HyperLogLogPlusPlus.DEFAULT_NORMAL_PRECISION;
+
+  // Cannot be instantiated. This class is intended to be a namespace only.
+  private HllCount() {}
+
+  /**
+   * Provide {@code PTransform}s to aggregate inputs into HLL++ sketches. The 
four supported input
+   * types are {@code Integer}, {@code Long}, {@code String}, and {@code 
byte[]}.
+   *
+   * Sketches are represented using the {@code byte[]} type. Sketches of 
the same type and {@code
+   * precision} can be merged into a new sketch using {@link 
HllCount.MergePartial}. Estimated count
+   * of distinct elements can be extracted from sketches using {@link 
HllCount.Extract}.
+   *
+   * Correspond to the {@code HLL_COUNT.INIT(input [, precision])} function 
in https://cloud.google.com/bigquery/docs/reference/standard-sql/hll_functions;>BigQuery.
+   */
+  public static 

[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-08-04 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=288640=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288640
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 05/Aug/19 00:19
Start Date: 05/Aug/19 00:19
Worklog Time Spent: 10m 
  Work Description: boyuanzz commented on pull request #9144: [BEAM-7013] 
Integrating ZetaSketch's HLL++ algorithm with Beam
URL: https://github.com/apache/beam/pull/9144#discussion_r310410938
 
 

 ##
 File path: 
sdks/java/extensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/HllCountTest.java
 ##
 @@ -0,0 +1,373 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.beam.sdk.extensions.zetasketch;
+
+import com.google.zetasketch.HyperLogLogPlusPlus;
+import com.google.zetasketch.shaded.com.google.protobuf.ByteString;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.List;
+import org.apache.beam.sdk.Pipeline.PipelineExecutionException;
+import org.apache.beam.sdk.testing.NeedsRunner;
+import org.apache.beam.sdk.testing.PAssert;
+import org.apache.beam.sdk.testing.TestPipeline;
+import org.apache.beam.sdk.transforms.Create;
+import org.apache.beam.sdk.values.KV;
+import org.apache.beam.sdk.values.PCollection;
+import org.apache.beam.sdk.values.TypeDescriptor;
+import org.junit.Rule;
+import org.junit.Test;
+import org.junit.experimental.categories.Category;
+import org.junit.rules.ExpectedException;
+import org.junit.runner.RunWith;
+import org.junit.runners.JUnit4;
+
+/** Tests for {@link HllCount}. */
+@RunWith(JUnit4.class)
+public class HllCountTest {
+
+  @Rule public final transient TestPipeline p = TestPipeline.create();
+  @Rule public transient ExpectedException thrown = ExpectedException.none();
+
+  // Integer
+  private static final List INTS1 = Arrays.asList(1, 2, 3, 3, 1, 4);
+  private static final byte[] INTS1_SKETCH;
+  private static final Long INTS1_ESTIMATE;
+
+  static {
+HyperLogLogPlusPlus hll = new 
HyperLogLogPlusPlus.Builder().buildForIntegers();
+INTS1.forEach(hll::add);
+INTS1_SKETCH = hll.serializeToByteArray();
+INTS1_ESTIMATE = hll.longResult();
+  }
+
+  private static final List INTS2 = Arrays.asList(3, 3, 3, 3);
+  private static final byte[] INTS2_SKETCH;
+  private static final Long INTS2_ESTIMATE;
+
+  static {
+HyperLogLogPlusPlus hll = new 
HyperLogLogPlusPlus.Builder().buildForIntegers();
+INTS2.forEach(hll::add);
+INTS2_SKETCH = hll.serializeToByteArray();
+INTS2_ESTIMATE = hll.longResult();
+  }
+
+  private static final byte[] INTS1_INTS2_SKETCH;
+
+  static {
+HyperLogLogPlusPlus hll = HyperLogLogPlusPlus.forProto(INTS1_SKETCH);
+hll.merge(INTS2_SKETCH);
+INTS1_INTS2_SKETCH = hll.serializeToByteArray();
+  }
+
+  // Long
+  private static final List LONGS = Collections.singletonList(1L);
+  private static final byte[] LONGS_SKETCH;
+
+  static {
+HyperLogLogPlusPlus hll = new 
HyperLogLogPlusPlus.Builder().buildForLongs();
+LONGS.forEach(hll::add);
+LONGS_SKETCH = hll.serializeToByteArray();
+  }
+
+  private static final byte[] LONGS_EMPTY_SKETCH;
+
+  static {
+HyperLogLogPlusPlus hll = new 
HyperLogLogPlusPlus.Builder().buildForLongs();
+LONGS_EMPTY_SKETCH = hll.serializeToByteArray();
+  }
+
+  // String
+  private static final List STRINGS = Arrays.asList("s1", "s2", "s1", 
"s2");
+  private static final byte[] STRINGS_SKETCH;
+
+  static {
+HyperLogLogPlusPlus hll = new 
HyperLogLogPlusPlus.Builder().buildForStrings();
+STRINGS.forEach(hll::add);
+STRINGS_SKETCH = hll.serializeToByteArray();
+  }
+
+  private static final int TEST_PRECISION = 20;
+  private static final byte[] STRINGS_SKETCH_TEST_PRECISION;
+
+  static {
+HyperLogLogPlusPlus hll =
+new 
HyperLogLogPlusPlus.Builder().normalPrecision(TEST_PRECISION).buildForStrings();
+STRINGS.forEach(hll::add);
+STRINGS_SKETCH_TEST_PRECISION = hll.serializeToByteArray();
+  }
+
+  // Bytes
+  private static final byte[] BYTES0 = {(byte) 0x1, (byte) 

[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-08-04 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=288639=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288639
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 05/Aug/19 00:18
Start Date: 05/Aug/19 00:18
Worklog Time Spent: 10m 
  Work Description: boyuanzz commented on pull request #9144: [BEAM-7013] 
Integrating ZetaSketch's HLL++ algorithm with Beam
URL: https://github.com/apache/beam/pull/9144#discussion_r310410891
 
 

 ##
 File path: 
sdks/java/extensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/HllCountTest.java
 ##
 @@ -0,0 +1,373 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.beam.sdk.extensions.zetasketch;
+
+import com.google.zetasketch.HyperLogLogPlusPlus;
+import com.google.zetasketch.shaded.com.google.protobuf.ByteString;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.List;
+import org.apache.beam.sdk.Pipeline.PipelineExecutionException;
+import org.apache.beam.sdk.testing.NeedsRunner;
+import org.apache.beam.sdk.testing.PAssert;
+import org.apache.beam.sdk.testing.TestPipeline;
+import org.apache.beam.sdk.transforms.Create;
+import org.apache.beam.sdk.values.KV;
+import org.apache.beam.sdk.values.PCollection;
+import org.apache.beam.sdk.values.TypeDescriptor;
+import org.junit.Rule;
+import org.junit.Test;
+import org.junit.experimental.categories.Category;
+import org.junit.rules.ExpectedException;
+import org.junit.runner.RunWith;
+import org.junit.runners.JUnit4;
+
+/** Tests for {@link HllCount}. */
+@RunWith(JUnit4.class)
+public class HllCountTest {
 
 Review comment:
   > The tests currently included in HllCountTest are intended to be at 
unit-test level.
   
   All test cases in HllCountTest are marked as `NeedsRunner`, which means 
these tests are required to run with a runner. Thus, I don't think these tests 
are at UT level. And if there is no test target running these tests, I'm not 
sure what the purpose of tests are.
   
   How to construct tests is depended on how you scope your feature. If you 
think your feature is dataflow only, then you should run your tests with 
dataflow runner. You can try to define a test target in your module's gradle 
file and make it depend on dataflow-runner project, but I'm not sure whether it 
will introduce circular dependencies.
   
   As I mentioned above, I totally agree it's better for us to have ITs with BQ 
if BQ is the major use case. But I would prefer having tests inside this PR. 
Otherwise, seems like we will not have any proof saying that this transform 
works, right?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 288639)
Time Spent: 10.5h  (was: 10h 20m)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 10.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-08-04 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=288637=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288637
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 04/Aug/19 23:56
Start Date: 04/Aug/19 23:56
Worklog Time Spent: 10m 
  Work Description: boyuanzz commented on pull request #9144: [BEAM-7013] 
Integrating ZetaSketch's HLL++ algorithm with Beam
URL: https://github.com/apache/beam/pull/9144#discussion_r310409829
 
 

 ##
 File path: 
sdks/java/extensions/zetasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCountMergePartialFn.java
 ##
 @@ -0,0 +1,151 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.beam.sdk.extensions.zetasketch;
+
+import com.google.zetasketch.HyperLogLogPlusPlus;
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.OutputStream;
+import javax.annotation.Nullable;
+import org.apache.beam.sdk.coders.AtomicCoder;
+import org.apache.beam.sdk.coders.Coder;
+import org.apache.beam.sdk.coders.CoderRegistry;
+import org.apache.beam.sdk.coders.NullableCoder;
+import 
org.apache.beam.sdk.extensions.zetasketch.HllCountMergePartialFn.HyperLogLogPlusPlusWrapper;
+import org.apache.beam.sdk.transforms.Combine;
+
+/**
+ * {@link Combine.CombineFn} for the {@link HllCount.MergePartial} combiner.
+ *
+ * @param  type of the HLL++ sketch to be merged
+ */
+class HllCountMergePartialFn
 
 Review comment:
   Seems like the connecting to BQ is the major use case of this transform. If 
that's the case, several ITs to demonstrate how to use this combine with BQ 
would be great. 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 288637)
Time Spent: 10h 20m  (was: 10h 10m)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 10h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-3489) Expose the message id of received messages within PubsubMessage

2019-08-04 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-3489?focusedWorklogId=288636=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288636
 ]

ASF GitHub Bot logged work on BEAM-3489:


Author: ASF GitHub Bot
Created on: 04/Aug/19 23:48
Start Date: 04/Aug/19 23:48
Worklog Time Spent: 10m 
  Work Description: thinhha commented on issue #8370: [BEAM-3489] add 
PubSub messageId in PubsubMessage
URL: https://github.com/apache/beam/pull/8370#issuecomment-518045772
 
 
   Hi @lukecwik, looks like the non-messageId coders are hard-coded in 
`PubsubUnboundedSource`:
   
   Line 1088 of 
`beam/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubUnboundedSource.java`:
   ```
   @Override
   public Coder getOutputCoder() {
 return outer.getNeedsAttributes()
 ? PubsubMessageWithAttributesCoder.of()
 : PubsubMessagePayloadOnlyCoder.of();
   }
   ```
   
   This means that while the PubsubMessage is created with the messageId field 
populated when using 
`PubsubIO.readMessagesWithMessageId()`/`PubsubIO.readMessagesWithAttributesAndMessageId()`
 , this field does not show up in the subsequent step.
   
   Looks like `setNeedsAttributes` is a required field. I'm not sure how to 
change `getOutputCoder()`
without breaking something.
   
   How do you think this case can be handled?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 288636)
Time Spent: 6h 20m  (was: 6h 10m)

> Expose the message id of received messages within PubsubMessage
> ---
>
> Key: BEAM-3489
> URL: https://issues.apache.org/jira/browse/BEAM-3489
> Project: Beam
>  Issue Type: New Feature
>  Components: io-java-gcp
>Reporter: Luke Cwik
>Assignee: Thinh Ha
>Priority: Minor
>  Labels: newbie, starter
>  Time Spent: 6h 20m
>  Remaining Estimate: 0h
>
> This task is about passing forward the message id from the pubsub proto to 
> the java PubsubMessage.
> Add a message id field to PubsubMessage.
> Update the coder for PubsubMessage to encode the message id.
> Update the translation from the Pubsub proto message to the Dataflow message:
> https://github.com/apache/beam/blob/2e275264b21db45787833502e5e42907b05e28b8/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubUnboundedSource.java#L976



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-3489) Expose the message id of received messages within PubsubMessage

2019-08-04 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-3489?focusedWorklogId=288635=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288635
 ]

ASF GitHub Bot logged work on BEAM-3489:


Author: ASF GitHub Bot
Created on: 04/Aug/19 23:46
Start Date: 04/Aug/19 23:46
Worklog Time Spent: 10m 
  Work Description: thinhha commented on issue #8370: [BEAM-3489] add 
PubSub messageId in PubsubMessage
URL: https://github.com/apache/beam/pull/8370#issuecomment-515803816
 
 
   I added some logging to the IT to help debug. Will remove them later. Sorry 
for doing that here as I'm not sure how to run the IT locally.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 288635)
Time Spent: 6h 10m  (was: 6h)

> Expose the message id of received messages within PubsubMessage
> ---
>
> Key: BEAM-3489
> URL: https://issues.apache.org/jira/browse/BEAM-3489
> Project: Beam
>  Issue Type: New Feature
>  Components: io-java-gcp
>Reporter: Luke Cwik
>Assignee: Thinh Ha
>Priority: Minor
>  Labels: newbie, starter
>  Time Spent: 6h 10m
>  Remaining Estimate: 0h
>
> This task is about passing forward the message id from the pubsub proto to 
> the java PubsubMessage.
> Add a message id field to PubsubMessage.
> Update the coder for PubsubMessage to encode the message id.
> Update the translation from the Pubsub proto message to the Dataflow message:
> https://github.com/apache/beam/blob/2e275264b21db45787833502e5e42907b05e28b8/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubUnboundedSource.java#L976



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-3489) Expose the message id of received messages within PubsubMessage

2019-08-04 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-3489?focusedWorklogId=288634=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288634
 ]

ASF GitHub Bot logged work on BEAM-3489:


Author: ASF GitHub Bot
Created on: 04/Aug/19 23:43
Start Date: 04/Aug/19 23:43
Worklog Time Spent: 10m 
  Work Description: thinhha commented on issue #8370: [BEAM-3489] add 
PubSub messageId in PubsubMessage
URL: https://github.com/apache/beam/pull/8370#issuecomment-518045772
 
 
   Hi @lukecwik, looks like the non-messageId coders are hard-coded in 
`PubsubUnboundedSource`:
   
   Line 1088 of 
`beam/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubUnboundedSource.java`:
   ```
   @Override
   public Coder getOutputCoder() {
 return outer.getNeedsAttributes()
 ? PubsubMessageWithAttributesCoder.of()
 : PubsubMessagePayloadOnlyCoder.of();
   }
   ```
   
   This means that while the PubsubMessage is created with the messageId field 
populated when using 
`PubsubIO.readMessagesWithMessageId()`/`PubsubIO.readMessagesWithAttributesAndMessageId()`
 , this field does not show up in the subsequent step.
   
   Looks like this is a required field for Autovalue. How do you think this 
case can be handled?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 288634)
Time Spent: 6h  (was: 5h 50m)

> Expose the message id of received messages within PubsubMessage
> ---
>
> Key: BEAM-3489
> URL: https://issues.apache.org/jira/browse/BEAM-3489
> Project: Beam
>  Issue Type: New Feature
>  Components: io-java-gcp
>Reporter: Luke Cwik
>Assignee: Thinh Ha
>Priority: Minor
>  Labels: newbie, starter
>  Time Spent: 6h
>  Remaining Estimate: 0h
>
> This task is about passing forward the message id from the pubsub proto to 
> the java PubsubMessage.
> Add a message id field to PubsubMessage.
> Update the coder for PubsubMessage to encode the message id.
> Update the translation from the Pubsub proto message to the Dataflow message:
> https://github.com/apache/beam/blob/2e275264b21db45787833502e5e42907b05e28b8/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubUnboundedSource.java#L976



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-3489) Expose the message id of received messages within PubsubMessage

2019-08-04 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-3489?focusedWorklogId=288633=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288633
 ]

ASF GitHub Bot logged work on BEAM-3489:


Author: ASF GitHub Bot
Created on: 04/Aug/19 23:42
Start Date: 04/Aug/19 23:42
Worklog Time Spent: 10m 
  Work Description: thinhha commented on issue #8370: [BEAM-3489] add 
PubSub messageId in PubsubMessage
URL: https://github.com/apache/beam/pull/8370#issuecomment-518045772
 
 
   Hi @lukecwik, looks like the non-messageId coders are hard-coded in 
`PubsubUnboundedSource`:
   
   Line 1088 of 
`beam/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubUnboundedSource.java`:
   ```
   @Override
   public Coder getOutputCoder() {
 return outer.getNeedsAttributes()
 ? PubsubMessageWithAttributesCoder.of()
 : PubsubMessagePayloadOnlyCoder.of();
   }
   ```
   
   This means that while the PubsubMessage is created with the messageId field 
using 
`PubsubIO.readMessagesWithMessageId()`/`PubsubIO.readMessagesWithAttributesAndMessageId()`
 , it does not show up in the subsequent step.
   
   Looks like this is a required field for Autovalue. How do you think this 
case can be handled?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 288633)
Time Spent: 5h 50m  (was: 5h 40m)

> Expose the message id of received messages within PubsubMessage
> ---
>
> Key: BEAM-3489
> URL: https://issues.apache.org/jira/browse/BEAM-3489
> Project: Beam
>  Issue Type: New Feature
>  Components: io-java-gcp
>Reporter: Luke Cwik
>Assignee: Thinh Ha
>Priority: Minor
>  Labels: newbie, starter
>  Time Spent: 5h 50m
>  Remaining Estimate: 0h
>
> This task is about passing forward the message id from the pubsub proto to 
> the java PubsubMessage.
> Add a message id field to PubsubMessage.
> Update the coder for PubsubMessage to encode the message id.
> Update the translation from the Pubsub proto message to the Dataflow message:
> https://github.com/apache/beam/blob/2e275264b21db45787833502e5e42907b05e28b8/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubUnboundedSource.java#L976



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7742) BigQuery File Loads to work well with load job size limits

2019-08-04 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7742?focusedWorklogId=288594=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288594
 ]

ASF GitHub Bot logged work on BEAM-7742:


Author: ASF GitHub Bot
Created on: 04/Aug/19 20:24
Start Date: 04/Aug/19 20:24
Worklog Time Spent: 10m 
  Work Description: ttanay commented on pull request #9242: [BEAM-7742] 
Partition files in BQFL to cater to quotas & limits
URL: https://github.com/apache/beam/pull/9242
 
 
   Partition files written based on:
   1. Total size of load job
   2. Max files per load job
   This ensures that all load jobs are triggered are respect the
   limits.
   Temp tables are only used when there are multiple partitions
   per destination. For destinations with a single destination,
   data is loaded directly to the destination table.
   
   
   
   
   Thank you for your contribution! Follow this checklist to help us 
incorporate your contribution quickly and easily:
   
- [ ] [**Choose 
reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and 
mention them in a comment (`R: @username`).
- [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in 
ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA 
issue, if applicable. This will automatically link the pull request to the 
issue.
- [ ] If this contribution is large, please file an Apache [Individual 
Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   Post-Commit Tests Status (on master branch)
   

   
   Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark
   --- | --- | --- | --- | --- | --- | --- | ---
   Go | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/)
 | --- | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/)
 | --- | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/)
   Java | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/)
   Python | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/)[![Build
 

[jira] [Work logged] (BEAM-7880) Upgrade Jackson databind to version 2.9.9.2

2019-08-04 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7880?focusedWorklogId=288593=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288593
 ]

ASF GitHub Bot logged work on BEAM-7880:


Author: ASF GitHub Bot
Created on: 04/Aug/19 20:23
Start Date: 04/Aug/19 20:23
Worklog Time Spent: 10m 
  Work Description: iemejia commented on issue #9229: [BEAM-7880] Upgrade 
Jackson databind to version 2.9.9.2
URL: https://github.com/apache/beam/pull/9229#issuecomment-518033388
 
 
   Just for info there is a [regression in the 2.9.9.2 
release](https://github.com/FasterXML/jackson-databind/issues/2395) that is 
biting us on Spark runner we have to wait for a new micro patch release. I will 
update this PR once it is out, should be happening this week.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 288593)
Time Spent: 2h  (was: 1h 50m)

> Upgrade Jackson databind to version 2.9.9.2
> ---
>
> Key: BEAM-7880
> URL: https://issues.apache.org/jira/browse/BEAM-7880
> Project: Beam
>  Issue Type: Improvement
>  Components: build-system, sdk-java-core
>Reporter: Ismaël Mejía
>Assignee: Ismaël Mejía
>Priority: Blocker
> Fix For: 2.15.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> Jackson databind 2.9.9 and earlier versions have multiple CVEs:
> https://www.cvedetails.com/cve/CVE-2019-12814
> https://www.cvedetails.com/cve/CVE-2019-12384
> https://www.cvedetails.com/cve/CVE-2019-14379
> https://www.cvedetails.com/cve/CVE-2019-14439



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7889) Make release scripts less interactive

2019-08-04 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7889?focusedWorklogId=288566=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-288566
 ]

ASF GitHub Bot logged work on BEAM-7889:


Author: ASF GitHub Bot
Created on: 04/Aug/19 18:38
Start Date: 04/Aug/19 18:38
Worklog Time Spent: 10m 
  Work Description: markflyhigh commented on pull request #9241: 
[BEAM-7889] Update RC validation guide due to run_rc_validation.sh change
URL: https://github.com/apache/beam/pull/9241
 
 
   New config file is introduced in https://github.com/apache/beam/pull/8929. 
So update release guide accordingly to make sure 
[script.config](https://github.com/apache/beam/blob/master/release/src/main/scripts/script.config)
 can be configured properly before people run `run_rc_validation.sh`
   
   
   
   Thank you for your contribution! Follow this checklist to help us 
incorporate your contribution quickly and easily:
   
- [ ] [**Choose 
reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and 
mention them in a comment (`R: @username`).
- [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in 
ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA 
issue, if applicable. This will automatically link the pull request to the 
issue.
- [ ] If this contribution is large, please file an Apache [Individual 
Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   Post-Commit Tests Status (on master branch)
   

   
   Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark
   --- | --- | --- | --- | --- | --- | --- | ---
   Go | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/)
 | --- | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/)
 | --- | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/)
   Java | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/)
   Python | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/)[![Build
 

[jira] [Created] (BEAM-7889) Make release scripts less interactive

2019-08-04 Thread Mark Liu (JIRA)
Mark Liu created BEAM-7889:
--

 Summary: Make release scripts less interactive
 Key: BEAM-7889
 URL: https://issues.apache.org/jira/browse/BEAM-7889
 Project: Beam
  Issue Type: Sub-task
  Components: build-system
Reporter: Mark Liu
Assignee: Mark Liu






--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (BEAM-2680) Improve scalability of the Watch transform

2019-08-04 Thread Steve Niemitz (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899655#comment-16899655
 ] 

Steve Niemitz commented on BEAM-2680:
-

I ran into this today.  We have a long-running streaming pipeline that watches 
a GCS bucket and processes files as they arrive.  It eventually was no longer 
able to make progress because the GrowthState (completed) was too big to fit 
into memory.

> Improve scalability of the Watch transform
> --
>
> Key: BEAM-2680
> URL: https://issues.apache.org/jira/browse/BEAM-2680
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core
>Reporter: Eugene Kirpichov
>Assignee: Eugene Kirpichov
>Priority: Major
>
> [https://github.com/apache/beam/pull/3565] introduces the Watch transform 
> [http://s.apache.org/beam-watch-transform].
> The implementation leaves several scalability-related TODOs:
>  1) The state stores hashes and timestamps of outputs that have already been 
> output and should be omitted from future polls. We could garbage-collect this 
> state, e.g. dropping elements from "completed" and from addNewAsPending() if 
> their timestamp is more than X behind the watermark.
>  2) When a poll returns a huge number of elements, we don't necessarily have 
> to add all of them into state.pending - instead we could add only N oldest 
> elements and ignore others, relying on future poll rounds to provide them, in 
> order to avoid blowing up the state. Combined with garbage collection of 
> GrowthState.completed, this would make the transform scalable to very large 
> poll results.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)