[jira] [Work logged] (BEAM-7720) Fix the exception type of InMemoryJobService when job id not found

2019-08-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7720?focusedWorklogId=295231=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-295231
 ]

ASF GitHub Bot logged work on BEAM-7720:


Author: ASF GitHub Bot
Created on: 15/Aug/19 05:55
Start Date: 15/Aug/19 05:55
Worklog Time Spent: 10m 
  Work Description: chadrik commented on issue #9347: [BEAM-7720] Fix the 
exception type of InMemoryJobService when job id not found
URL: https://github.com/apache/beam/pull/9347#issuecomment-521524078
 
 
   Original discussion about this problem is here: 
https://github.com/apache/beam/pull/8977#discussion_r301243749
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 295231)
Time Spent: 0.5h  (was: 20m)

> Fix the exception type of InMemoryJobService when job id not found
> --
>
> Key: BEAM-7720
> URL: https://issues.apache.org/jira/browse/BEAM-7720
> Project: Beam
>  Issue Type: Bug
>  Components: beam-model
>Reporter: Chad Dombrova
>Assignee: Chad Dombrova
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The contract in beam_job_api.proto for `CancelJobRequest`, 
> `GetJobStateRequest`, and `GetJobPipelineRequest` states:
>   
> {noformat}
> // Throws error NOT_FOUND if the jobId is not found{noformat}
>   
> However, `InMemoryJobService` is handling this exception incorrectly by 
> rethrowing `NOT_FOUND` exceptions as `INTERNAL`.
> neither `JobMessagesRequest` nor `GetJobMetricsRequest` state their contract 
> wrt exceptions, but they should probably be updated to handle `NOT_FOUND` in 
> the same way.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Assigned] (BEAM-7720) Fix the exception type of InMemoryJobService when job id not found

2019-08-14 Thread Chad Dombrova (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chad Dombrova reassigned BEAM-7720:
---

Assignee: Chad Dombrova

> Fix the exception type of InMemoryJobService when job id not found
> --
>
> Key: BEAM-7720
> URL: https://issues.apache.org/jira/browse/BEAM-7720
> Project: Beam
>  Issue Type: Bug
>  Components: beam-model
>Reporter: Chad Dombrova
>Assignee: Chad Dombrova
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The contract in beam_job_api.proto for `CancelJobRequest`, 
> `GetJobStateRequest`, and `GetJobPipelineRequest` states:
>   
> {noformat}
> // Throws error NOT_FOUND if the jobId is not found{noformat}
>   
> However, `InMemoryJobService` is handling this exception incorrectly by 
> rethrowing `NOT_FOUND` exceptions as `INTERNAL`.
> neither `JobMessagesRequest` nor `GetJobMetricsRequest` state their contract 
> wrt exceptions, but they should probably be updated to handle `NOT_FOUND` in 
> the same way.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7720) Fix the exception type of InMemoryJobService when job id not found

2019-08-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7720?focusedWorklogId=295227=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-295227
 ]

ASF GitHub Bot logged work on BEAM-7720:


Author: ASF GitHub Bot
Created on: 15/Aug/19 05:53
Start Date: 15/Aug/19 05:53
Worklog Time Spent: 10m 
  Work Description: chadrik commented on issue #9347: [BEAM-7720] Fix the 
exception type of InMemoryJobService when job id not found
URL: https://github.com/apache/beam/pull/9347#issuecomment-521523548
 
 
   R: @lukecwik 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 295227)
Time Spent: 20m  (was: 10m)

> Fix the exception type of InMemoryJobService when job id not found
> --
>
> Key: BEAM-7720
> URL: https://issues.apache.org/jira/browse/BEAM-7720
> Project: Beam
>  Issue Type: Bug
>  Components: beam-model
>Reporter: Chad Dombrova
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The contract in beam_job_api.proto for `CancelJobRequest`, 
> `GetJobStateRequest`, and `GetJobPipelineRequest` states:
>   
> {noformat}
> // Throws error NOT_FOUND if the jobId is not found{noformat}
>   
> However, `InMemoryJobService` is handling this exception incorrectly by 
> rethrowing `NOT_FOUND` exceptions as `INTERNAL`.
> neither `JobMessagesRequest` nor `GetJobMetricsRequest` state their contract 
> wrt exceptions, but they should probably be updated to handle `NOT_FOUND` in 
> the same way.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7720) Fix the exception type of InMemoryJobService when job id not found

2019-08-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7720?focusedWorklogId=295226=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-295226
 ]

ASF GitHub Bot logged work on BEAM-7720:


Author: ASF GitHub Bot
Created on: 15/Aug/19 05:52
Start Date: 15/Aug/19 05:52
Worklog Time Spent: 10m 
  Work Description: chadrik commented on pull request #9347: [BEAM-7720] 
Fix the exception type of InMemoryJobService when job id not found
URL: https://github.com/apache/beam/pull/9347
 
 
   Everywhere that we call getInvocation() we should handle the possible 
StatusException
   
   
   
   
   Thank you for your contribution! Follow this checklist to help us 
incorporate your contribution quickly and easily:
   
- [ ] [**Choose 
reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and 
mention them in a comment (`R: @username`).
- [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in 
ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA 
issue, if applicable. This will automatically link the pull request to the 
issue.
- [ ] If this contribution is large, please file an Apache [Individual 
Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   Post-Commit Tests Status (on master branch)
   

   
   Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark
   --- | --- | --- | --- | --- | --- | --- | ---
   Go | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/)
 | --- | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/)
 | --- | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/)
   Java | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/)
   Python | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/)[![Build
 

[jira] [Issue Comment Deleted] (BEAM-7871) Streaming from PubSub to Firestore doesn't work on Dataflow

2019-08-14 Thread Bruce Arctor (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Arctor updated BEAM-7871:
---
Comment: was deleted

(was: [~zdenulo], is this due to lack of there being an IO to firebase api 
(firestore in native mode)?  I do see that there is a datastore connector, so 
was wondering if that works with firestore in datastore mode.  Haven't tried 
yet, but this is a task I was hoping to accomplish (pubsub -> beam/dataflow -> 
firestore in native mode, meaning firebase api) – just starting to look into it 
which led me here.  )

> Streaming from PubSub to Firestore doesn't work on Dataflow
> ---
>
> Key: BEAM-7871
> URL: https://issues.apache.org/jira/browse/BEAM-7871
> Project: Beam
>  Issue Type: Bug
>  Components: io-py-gcp, runner-dataflow
>Affects Versions: 2.13.0
>Reporter: Zdenko Hrcek
>Priority: Major
>
> I came to the same error as here 
> [https://stackoverflow.com/questions/57059944/python-package-errors-while-running-gcp-dataflow]
>  but I don't see anywhere reported thus I am creating an issue just in case.
> The pipeline is quite simple, reading from PubSub and writing to Firestore.
> Beam version used is 2.13.0, Python 2.7
> With DirectRunner works ok, but on Dataflow it throws the following message:
>  
> {code:java}
> java.util.concurrent.ExecutionException: java.lang.RuntimeException: Error 
> received from SDK harness for instruction -81: Traceback (most recent call 
> last):
>  File 
> "/usr/local/lib/python2.7/dist-packages/apache_beam/runners/worker/sdk_worker.py",
>  line 157, in _execute
>  response = task()
>  File 
> "/usr/local/lib/python2.7/dist-packages/apache_beam/runners/worker/sdk_worker.py",
>  line 190, in 
>  self._execute(lambda: worker.do_instruction(work), work)
>  File 
> "/usr/local/lib/python2.7/dist-packages/apache_beam/runners/worker/sdk_worker.py",
>  line 312, in do_instruction
>  request.instruction_id)
>  File 
> "/usr/local/lib/python2.7/dist-packages/apache_beam/runners/worker/sdk_worker.py",
>  line 331, in process_bundle
>  bundle_processor.process_bundle(instruction_id))
>  File 
> "/usr/local/lib/python2.7/dist-packages/apache_beam/runners/worker/bundle_processor.py",
>  line 538, in process_bundle
>  op.start()
>  File "apache_beam/runners/worker/operations.py", line 554, in 
> apache_beam.runners.worker.operations.DoOperation.start
>  def start(self):
>  File "apache_beam/runners/worker/operations.py", line 555, in 
> apache_beam.runners.worker.operations.DoOperation.start
>  with self.scoped_start_state:
>  File "apache_beam/runners/worker/operations.py", line 557, in 
> apache_beam.runners.worker.operations.DoOperation.start
>  self.dofn_runner.start()
>  File "apache_beam/runners/common.py", line 778, in 
> apache_beam.runners.common.DoFnRunner.start
>  self._invoke_bundle_method(self.do_fn_invoker.invoke_start_bundle)
>  File "apache_beam/runners/common.py", line 775, in 
> apache_beam.runners.common.DoFnRunner._invoke_bundle_method
>  self._reraise_augmented(exn)
>  File "apache_beam/runners/common.py", line 800, in 
> apache_beam.runners.common.DoFnRunner._reraise_augmented
>  raise_with_traceback(new_exn)
>  File "apache_beam/runners/common.py", line 773, in 
> apache_beam.runners.common.DoFnRunner._invoke_bundle_method
>  bundle_method()
>  File "apache_beam/runners/common.py", line 359, in 
> apache_beam.runners.common.DoFnInvoker.invoke_start_bundle
>  def invoke_start_bundle(self):
>  File "apache_beam/runners/common.py", line 363, in 
> apache_beam.runners.common.DoFnInvoker.invoke_start_bundle
>  self.signature.start_bundle_method.method_value())
>  File 
> "/home/zdenulo/dev/gcp_stuff/df_firestore_stream/df_firestore_stream.py", 
> line 39, in start_bundle
> NameError: global name 'firestore' is not defined [while running 
> 'generatedPtransform-64']
>  
> {code}
> It's interesting that using Beam version 2.12.0 solves the problem on 
> Dataflow, it works as expected, not sure what could be the problem.
> Here is a repository with the code which was used 
> [https://github.com/zdenulo/dataflow_firestore_stream]
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7742) BigQuery File Loads to work well with load job size limits

2019-08-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7742?focusedWorklogId=295189=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-295189
 ]

ASF GitHub Bot logged work on BEAM-7742:


Author: ASF GitHub Bot
Created on: 15/Aug/19 04:45
Start Date: 15/Aug/19 04:45
Worklog Time Spent: 10m 
  Work Description: ttanay commented on issue #9242: [BEAM-7742] Partition 
files in BQFL to cater to quotas & limits
URL: https://github.com/apache/beam/pull/9242#issuecomment-521511917
 
 
   > Ah we discussed this, right? We need temp tables for dynamic dstinations : 
)
   
   Yes. Making the change :hammer: 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 295189)
Time Spent: 1h 40m  (was: 1.5h)

> BigQuery File Loads to work well with load job size limits
> --
>
> Key: BEAM-7742
> URL: https://issues.apache.org/jira/browse/BEAM-7742
> Project: Beam
>  Issue Type: Improvement
>  Components: io-py-gcp
>Reporter: Pablo Estrada
>Assignee: Tanay Tummalapalli
>Priority: Major
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Load jobs into BigQuery have a number of limitations: 
> [https://cloud.google.com/bigquery/quotas#load_jobs]
>  
> Currently, the python BQ sink implemented in `bigquery_file_loads.py` does 
> not handle these limitations well. Improvements need to be made to the 
> miplementation, to:
>  * Decide to use temp_tables dynamically at pipeline execution
>  * Add code to determine when a load job to a single destination needs to be 
> partitioned into multiple jobs.
>  * When this happens, then we definitely need to use temp_tables, in case one 
> of the two load jobs fails, and the pipeline is rerun.
> Tanay, would you be able to look at this?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (BEAM-7981) ParDo function wrapper doesn't support Iterable output types

2019-08-14 Thread Ahmet Altay (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-7981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907813#comment-16907813
 ] 

Ahmet Altay commented on BEAM-7981:
---

Ack. Thank you.

> ParDo function wrapper doesn't support Iterable output types
> 
>
> Key: BEAM-7981
> URL: https://issues.apache.org/jira/browse/BEAM-7981
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Udi Meiri
>Priority: Major
>
> I believe the bug is in CallableWrapperDoFn.default_type_hints, which 
> converts Iterable[str] to str.
> This test will be included (commented out) in 
> https://github.com/apache/beam/pull/9283
> {code}
>   def test_typed_callable_iterable_output(self):
> @typehints.with_input_types(int)
> @typehints.with_output_types(typehints.Iterable[str])
> def do_fn(element):
>   return [[str(element)] * 2]
> result = [1, 2] | beam.ParDo(do_fn)
> self.assertEqual([['1', '1'], ['2', '2']], sorted(result))
> {code}
> Result:
> {code}
> ==
> ERROR: test_typed_callable_iterable_output 
> (apache_beam.typehints.typed_pipeline_test.MainInputTest)
> --
> Traceback (most recent call last):
>   File 
> "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/typehints/typed_pipeline_test.py",
>  line 104, in test_typed_callable_iterable_output
> result = [1, 2] | beam.ParDo(do_fn)
>   File 
> "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/transforms/ptransform.py",
>  line 519, in __ror__
> p.run().wait_until_finish()
>   File 
> "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/pipeline.py", 
> line 406, in run
> self._options).run(False)
>   File 
> "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/pipeline.py", 
> line 419, in run
> return self.runner.run_pipeline(self, self._options)
>   File 
> "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/runners/direct/direct_runner.py",
>  line 129, in run_pipeline
> return runner.run_pipeline(pipeline, options)
>   File 
> "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/runners/portability/fn_api_runner.py",
>  line 366, in run_pipeline
> default_environment=self._default_environment))
>   File 
> "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/runners/portability/fn_api_runner.py",
>  line 373, in run_via_runner_api
> return self.run_stages(stage_context, stages)
>   File 
> "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/runners/portability/fn_api_runner.py",
>  line 455, in run_stages
> stage_context.safe_coders)
>   File 
> "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/runners/portability/fn_api_runner.py",
>  line 733, in _run_stage
> result, splits = bundle_manager.process_bundle(data_input, data_output)
>   File 
> "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/runners/portability/fn_api_runner.py",
>  line 1663, in process_bundle
> part, expected_outputs), part_inputs):
>   File "/usr/lib/python3.7/concurrent/futures/_base.py", line 586, in 
> result_iterator
> yield fs.pop().result()
>   File "/usr/lib/python3.7/concurrent/futures/_base.py", line 432, in result
> return self.__get_result()
>   File "/usr/lib/python3.7/concurrent/futures/_base.py", line 384, in 
> __get_result
> raise self._exception
>   File "/usr/lib/python3.7/concurrent/futures/thread.py", line 57, in run
> result = self.fn(*self.args, **self.kwargs)
>   File 
> "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/runners/portability/fn_api_runner.py",
>  line 1663, in 
> part, expected_outputs), part_inputs):
>   File 
> "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/runners/portability/fn_api_runner.py",
>  line 1601, in process_bundle
> result_future = self._worker_handler.control_conn.push(process_bundle_req)
>   File 
> "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/runners/portability/fn_api_runner.py",
>  line 1080, in push
> response = self.worker.do_instruction(request)
>   File 
> "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/runners/worker/sdk_worker.py",
>  line 343, in do_instruction
> request.instruction_id)
>   File 
> "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/runners/worker/sdk_worker.py",
>  line 369, in process_bundle
> bundle_processor.process_bundle(instruction_id))
>   File 
> "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/runners/worker/bundle_processor.py",
>  line 593, in process_bundle
> data.ptransform_id].process_encoded(data.data)
>   File 
> 

[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-08-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=295172=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-295172
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 15/Aug/19 04:14
Start Date: 15/Aug/19 04:14
Worklog Time Spent: 10m 
  Work Description: robinyqiu commented on issue #9144: [BEAM-7013] 
Integrating ZetaSketch's HLL++ algorithm with Beam
URL: https://github.com/apache/beam/pull/9144#issuecomment-521507156
 
 
   Hi all! I have implemented integration tests for `HllCount` (running on 
`DataflowRunner`) to test sketch compatibility between Beam and BigQuery. PTAL!
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 295172)
Time Spent: 17.5h  (was: 17h 20m)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 17.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-08-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=295171=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-295171
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 15/Aug/19 04:12
Start Date: 15/Aug/19 04:12
Worklog Time Spent: 10m 
  Work Description: robinyqiu commented on pull request #9144: [BEAM-7013] 
Integrating ZetaSketch's HLL++ algorithm with Beam
URL: https://github.com/apache/beam/pull/9144#discussion_r314169000
 
 

 ##
 File path: 
sdks/java/extensions/zetasketch/src/main/java/org/apache/beam/sdk/extensions/zetasketch/HllCount.java
 ##
 @@ -0,0 +1,345 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.beam.sdk.extensions.zetasketch;
+
+import com.google.zetasketch.HyperLogLogPlusPlus;
+import org.apache.beam.sdk.annotations.Experimental;
+import org.apache.beam.sdk.transforms.Combine;
+import org.apache.beam.sdk.transforms.DoFn;
+import org.apache.beam.sdk.transforms.PTransform;
+import org.apache.beam.sdk.transforms.ParDo;
+import org.apache.beam.sdk.values.KV;
+import org.apache.beam.sdk.values.PCollection;
+
+/**
+ * {@code PTransform}s to compute HyperLogLogPlusPlus (HLL++) sketches on data 
streams based on the
+ * https://github.com/google/zetasketch;>ZetaSketch 
implementation.
+ *
+ * HLL++ is an algorithm implemented by Google that estimates the count of 
distinct elements in a
+ * data stream. HLL++ requires significantly less memory than the linear 
memory needed for exact
+ * computation, at the cost of a small error. Cardinalities of arbitrary 
breakdowns can be computed
+ * using the HLL++ sketch. See this http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/40671.pdf;>published
+ * paper for details about the algorithm.
+ *
+ * HLL++ functions are also supported in https://cloud.google.com/bigquery/docs/reference/standard-sql/hll_functions;>Google
 Cloud
+ * BigQuery. Using the {@code HllCount PTransform}s makes the 
interoperation with BigQuery
+ * easier.
+ *
+ * Examples
+ *
+ * Example 1: Create long-type sketch for a {@code PCollection} and 
specify precision
+ *
+ * {@code
+ * PCollection input = ...;
+ * int p = ...;
+ * PCollection sketch = 
input.apply(HllCount.Init.longSketch().withPrecision(p).globally());
+ * }
+ *
+ * Example 2: Create bytes-type sketch for a {@code PCollection>}
+ *
+ * {@code
+ * PCollection> input = ...;
+ * PCollection> sketch = 
input.apply(HllCount.Init.bytesSketch().perKey());
+ * }
+ *
+ * Example 3: Merge existing sketches in a {@code PCollection} 
into a new one
+ *
+ * {@code
+ * PCollection sketches = ...;
+ * PCollection mergedSketch = 
sketches.apply(HllCount.MergePartial.globally());
+ * }
+ *
+ * Example 4: Estimates the count of distinct elements in a {@code 
PCollection}
+ *
+ * {@code
+ * PCollection input = ...;
+ * PCollection countDistinct =
+ * 
input.apply(HllCount.Init.stringSketch().globally()).apply(HllCount.Extract.globally());
+ * }
+ */
+@Experimental
+public final class HllCount {
+
+  public static final int MINIMUM_PRECISION = 
HyperLogLogPlusPlus.MINIMUM_PRECISION;
+  public static final int MAXIMUM_PRECISION = 
HyperLogLogPlusPlus.MAXIMUM_PRECISION;
+  public static final int DEFAULT_PRECISION = 
HyperLogLogPlusPlus.DEFAULT_NORMAL_PRECISION;
+
+  // Cannot be instantiated. This class is intended to be a namespace only.
+  private HllCount() {}
+
+  /**
+   * Provide {@code PTransform}s to aggregate inputs into HLL++ sketches. The 
four supported input
+   * types are {@code Integer}, {@code Long}, {@code String}, and {@code 
byte[]}.
+   *
+   * Sketches are represented using the {@code byte[]} type. Sketches of 
the same type and {@code
+   * precision} can be merged into a new sketch using {@link 
HllCount.MergePartial}. Estimated count
+   * of distinct elements can be extracted from sketches using {@link 
HllCount.Extract}.
+   *
+   * Correspond to the {@code HLL_COUNT.INIT(input [, precision])} function 
in https://cloud.google.com/bigquery/docs/reference/standard-sql/hll_functions;>BigQuery.
+   */
+  public static 

[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-08-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=295168=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-295168
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 15/Aug/19 04:02
Start Date: 15/Aug/19 04:02
Worklog Time Spent: 10m 
  Work Description: robinyqiu commented on pull request #9144: [BEAM-7013] 
Integrating ZetaSketch's HLL++ algorithm with Beam
URL: https://github.com/apache/beam/pull/9144#discussion_r314167778
 
 

 ##
 File path: 
sdks/java/extensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/HllCountTest.java
 ##
 @@ -0,0 +1,373 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.beam.sdk.extensions.zetasketch;
+
+import com.google.zetasketch.HyperLogLogPlusPlus;
+import com.google.zetasketch.shaded.com.google.protobuf.ByteString;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.List;
+import org.apache.beam.sdk.Pipeline.PipelineExecutionException;
+import org.apache.beam.sdk.testing.NeedsRunner;
+import org.apache.beam.sdk.testing.PAssert;
+import org.apache.beam.sdk.testing.TestPipeline;
+import org.apache.beam.sdk.transforms.Create;
+import org.apache.beam.sdk.values.KV;
+import org.apache.beam.sdk.values.PCollection;
+import org.apache.beam.sdk.values.TypeDescriptor;
+import org.junit.Rule;
+import org.junit.Test;
+import org.junit.experimental.categories.Category;
+import org.junit.rules.ExpectedException;
+import org.junit.runner.RunWith;
+import org.junit.runners.JUnit4;
+
+/** Tests for {@link HllCount}. */
+@RunWith(JUnit4.class)
+public class HllCountTest {
 
 Review comment:
   Hi. The integration tests (running on Dataflow and BigQuery) is added to 
this PR! PTAL.
   
   Here is the link showing the tests run successfully on Jenkins as a Java 
PostCommit test: 
https://builds.apache.org/job/beam_PostCommit_Java_PR/221/testReport/org.apache.beam.sdk.extensions.zetasketch/
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 295168)
Time Spent: 17h  (was: 16h 50m)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 17h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-08-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=295167=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-295167
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 15/Aug/19 04:01
Start Date: 15/Aug/19 04:01
Worklog Time Spent: 10m 
  Work Description: robinyqiu commented on pull request #9144: [BEAM-7013] 
Integrating ZetaSketch's HLL++ algorithm with Beam
URL: https://github.com/apache/beam/pull/9144#discussion_r314167778
 
 

 ##
 File path: 
sdks/java/extensions/zetasketch/src/test/java/org/apache/beam/sdk/extensions/zetasketch/HllCountTest.java
 ##
 @@ -0,0 +1,373 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.beam.sdk.extensions.zetasketch;
+
+import com.google.zetasketch.HyperLogLogPlusPlus;
+import com.google.zetasketch.shaded.com.google.protobuf.ByteString;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.List;
+import org.apache.beam.sdk.Pipeline.PipelineExecutionException;
+import org.apache.beam.sdk.testing.NeedsRunner;
+import org.apache.beam.sdk.testing.PAssert;
+import org.apache.beam.sdk.testing.TestPipeline;
+import org.apache.beam.sdk.transforms.Create;
+import org.apache.beam.sdk.values.KV;
+import org.apache.beam.sdk.values.PCollection;
+import org.apache.beam.sdk.values.TypeDescriptor;
+import org.junit.Rule;
+import org.junit.Test;
+import org.junit.experimental.categories.Category;
+import org.junit.rules.ExpectedException;
+import org.junit.runner.RunWith;
+import org.junit.runners.JUnit4;
+
+/** Tests for {@link HllCount}. */
+@RunWith(JUnit4.class)
+public class HllCountTest {
 
 Review comment:
   Hi. The integration tests (running on Dataflow and BigQuery) in added to 
this PR! PTAL.
   
   Here is the link showing the tests run successfully on Jenkins as a Java 
PostCommit test: 
https://builds.apache.org/job/beam_PostCommit_Java_PR/221/testReport/org.apache.beam.sdk.extensions.zetasketch/
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 295167)
Time Spent: 16h 50m  (was: 16h 40m)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 16h 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7742) BigQuery File Loads to work well with load job size limits

2019-08-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7742?focusedWorklogId=295161=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-295161
 ]

ASF GitHub Bot logged work on BEAM-7742:


Author: ASF GitHub Bot
Created on: 15/Aug/19 03:10
Start Date: 15/Aug/19 03:10
Worklog Time Spent: 10m 
  Work Description: pabloem commented on issue #9242: [BEAM-7742] Partition 
files in BQFL to cater to quotas & limits
URL: https://github.com/apache/beam/pull/9242#issuecomment-521497591
 
 
   Ah we discussed this, right? We need temp tables for dynamic dstinations : )
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 295161)
Time Spent: 1.5h  (was: 1h 20m)

> BigQuery File Loads to work well with load job size limits
> --
>
> Key: BEAM-7742
> URL: https://issues.apache.org/jira/browse/BEAM-7742
> Project: Beam
>  Issue Type: Improvement
>  Components: io-py-gcp
>Reporter: Pablo Estrada
>Assignee: Tanay Tummalapalli
>Priority: Major
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Load jobs into BigQuery have a number of limitations: 
> [https://cloud.google.com/bigquery/quotas#load_jobs]
>  
> Currently, the python BQ sink implemented in `bigquery_file_loads.py` does 
> not handle these limitations well. Improvements need to be made to the 
> miplementation, to:
>  * Decide to use temp_tables dynamically at pipeline execution
>  * Add code to determine when a load job to a single destination needs to be 
> partitioned into multiple jobs.
>  * When this happens, then we definitely need to use temp_tables, in case one 
> of the two load jobs fails, and the pipeline is rerun.
> Tanay, would you be able to look at this?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7984) [python] The coder returned for typehints.List should be IterableCoder

2019-08-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7984?focusedWorklogId=295149=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-295149
 ]

ASF GitHub Bot logged work on BEAM-7984:


Author: ASF GitHub Bot
Created on: 15/Aug/19 02:50
Start Date: 15/Aug/19 02:50
Worklog Time Spent: 10m 
  Work Description: chadrik commented on issue #9344: [BEAM-7984] The coder 
returned for typehints.List should be IterableCoder
URL: https://github.com/apache/beam/pull/9344#issuecomment-521494404
 
 
   R: @robertwb 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 295149)
Time Spent: 20m  (was: 10m)

> [python] The coder returned for typehints.List should be IterableCoder
> --
>
> Key: BEAM-7984
> URL: https://issues.apache.org/jira/browse/BEAM-7984
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py-core
>Reporter: Chad Dombrova
>Assignee: Chad Dombrova
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> IterableCoder encodes a list and decodes to list, but 
> typecoders.registry.get_coder(typehints.List[bytes]) returns a 
> FastPrimitiveCoder.  I don't see any reason why this would be advantageous. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7984) [python] The coder returned for typehints.List should be IterableCoder

2019-08-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7984?focusedWorklogId=295148=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-295148
 ]

ASF GitHub Bot logged work on BEAM-7984:


Author: ASF GitHub Bot
Created on: 15/Aug/19 02:49
Start Date: 15/Aug/19 02:49
Worklog Time Spent: 10m 
  Work Description: chadrik commented on pull request #9344: [BEAM-7984] 
The coder returned for typehints.List should be IterableCoder
URL: https://github.com/apache/beam/pull/9344
 
 
   **Please** add a meaningful description for your change here
   
   
   
   Thank you for your contribution! Follow this checklist to help us 
incorporate your contribution quickly and easily:
   
- [x ] [**Choose 
reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and 
mention them in a comment (`R: @username`).
- [ x] Format the pull request title like `[BEAM-XXX] Fixes bug in 
ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA 
issue, if applicable. This will automatically link the pull request to the 
issue.
- [ x] If this contribution is large, please file an Apache [Individual 
Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   Post-Commit Tests Status (on master branch)
   

   
   Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark
   --- | --- | --- | --- | --- | --- | --- | ---
   Go | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/)
 | --- | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/)
 | --- | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/)
   Java | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/)
   Python | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/)[![Build
 

[jira] [Created] (BEAM-7984) [python] The coder returned for typehints.List should be IterableCoder

2019-08-14 Thread Chad Dombrova (JIRA)
Chad Dombrova created BEAM-7984:
---

 Summary: [python] The coder returned for typehints.List should be 
IterableCoder
 Key: BEAM-7984
 URL: https://issues.apache.org/jira/browse/BEAM-7984
 Project: Beam
  Issue Type: Improvement
  Components: sdk-py-core
Reporter: Chad Dombrova
Assignee: Chad Dombrova


IterableCoder encodes a list and decodes to list, but 
typecoders.registry.get_coder(typehints.List[bytes]) returns a 
FastPrimitiveCoder.  I don't see any reason why this would be advantageous. 




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7742) BigQuery File Loads to work well with load job size limits

2019-08-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7742?focusedWorklogId=295134=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-295134
 ]

ASF GitHub Bot logged work on BEAM-7742:


Author: ASF GitHub Bot
Created on: 15/Aug/19 02:10
Start Date: 15/Aug/19 02:10
Worklog Time Spent: 10m 
  Work Description: pabloem commented on issue #9242: [BEAM-7742] Partition 
files in BQFL to cater to quotas & limits
URL: https://github.com/apache/beam/pull/9242#issuecomment-521487610
 
 
   Run Python 3.5 PostCommit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 295134)
Time Spent: 1h 20m  (was: 1h 10m)

> BigQuery File Loads to work well with load job size limits
> --
>
> Key: BEAM-7742
> URL: https://issues.apache.org/jira/browse/BEAM-7742
> Project: Beam
>  Issue Type: Improvement
>  Components: io-py-gcp
>Reporter: Pablo Estrada
>Assignee: Tanay Tummalapalli
>Priority: Major
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Load jobs into BigQuery have a number of limitations: 
> [https://cloud.google.com/bigquery/quotas#load_jobs]
>  
> Currently, the python BQ sink implemented in `bigquery_file_loads.py` does 
> not handle these limitations well. Improvements need to be made to the 
> miplementation, to:
>  * Decide to use temp_tables dynamically at pipeline execution
>  * Add code to determine when a load job to a single destination needs to be 
> partitioned into multiple jobs.
>  * When this happens, then we definitely need to use temp_tables, in case one 
> of the two load jobs fails, and the pipeline is rerun.
> Tanay, would you be able to look at this?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (BEAM-7981) ParDo function wrapper doesn't support Iterable output types

2019-08-14 Thread Udi Meiri (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-7981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907753#comment-16907753
 ] 

Udi Meiri commented on BEAM-7981:
-

I have not committed to working on it yet.




> ParDo function wrapper doesn't support Iterable output types
> 
>
> Key: BEAM-7981
> URL: https://issues.apache.org/jira/browse/BEAM-7981
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Udi Meiri
>Priority: Major
>
> I believe the bug is in CallableWrapperDoFn.default_type_hints, which 
> converts Iterable[str] to str.
> This test will be included (commented out) in 
> https://github.com/apache/beam/pull/9283
> {code}
>   def test_typed_callable_iterable_output(self):
> @typehints.with_input_types(int)
> @typehints.with_output_types(typehints.Iterable[str])
> def do_fn(element):
>   return [[str(element)] * 2]
> result = [1, 2] | beam.ParDo(do_fn)
> self.assertEqual([['1', '1'], ['2', '2']], sorted(result))
> {code}
> Result:
> {code}
> ==
> ERROR: test_typed_callable_iterable_output 
> (apache_beam.typehints.typed_pipeline_test.MainInputTest)
> --
> Traceback (most recent call last):
>   File 
> "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/typehints/typed_pipeline_test.py",
>  line 104, in test_typed_callable_iterable_output
> result = [1, 2] | beam.ParDo(do_fn)
>   File 
> "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/transforms/ptransform.py",
>  line 519, in __ror__
> p.run().wait_until_finish()
>   File 
> "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/pipeline.py", 
> line 406, in run
> self._options).run(False)
>   File 
> "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/pipeline.py", 
> line 419, in run
> return self.runner.run_pipeline(self, self._options)
>   File 
> "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/runners/direct/direct_runner.py",
>  line 129, in run_pipeline
> return runner.run_pipeline(pipeline, options)
>   File 
> "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/runners/portability/fn_api_runner.py",
>  line 366, in run_pipeline
> default_environment=self._default_environment))
>   File 
> "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/runners/portability/fn_api_runner.py",
>  line 373, in run_via_runner_api
> return self.run_stages(stage_context, stages)
>   File 
> "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/runners/portability/fn_api_runner.py",
>  line 455, in run_stages
> stage_context.safe_coders)
>   File 
> "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/runners/portability/fn_api_runner.py",
>  line 733, in _run_stage
> result, splits = bundle_manager.process_bundle(data_input, data_output)
>   File 
> "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/runners/portability/fn_api_runner.py",
>  line 1663, in process_bundle
> part, expected_outputs), part_inputs):
>   File "/usr/lib/python3.7/concurrent/futures/_base.py", line 586, in 
> result_iterator
> yield fs.pop().result()
>   File "/usr/lib/python3.7/concurrent/futures/_base.py", line 432, in result
> return self.__get_result()
>   File "/usr/lib/python3.7/concurrent/futures/_base.py", line 384, in 
> __get_result
> raise self._exception
>   File "/usr/lib/python3.7/concurrent/futures/thread.py", line 57, in run
> result = self.fn(*self.args, **self.kwargs)
>   File 
> "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/runners/portability/fn_api_runner.py",
>  line 1663, in 
> part, expected_outputs), part_inputs):
>   File 
> "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/runners/portability/fn_api_runner.py",
>  line 1601, in process_bundle
> result_future = self._worker_handler.control_conn.push(process_bundle_req)
>   File 
> "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/runners/portability/fn_api_runner.py",
>  line 1080, in push
> response = self.worker.do_instruction(request)
>   File 
> "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/runners/worker/sdk_worker.py",
>  line 343, in do_instruction
> request.instruction_id)
>   File 
> "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/runners/worker/sdk_worker.py",
>  line 369, in process_bundle
> bundle_processor.process_bundle(instruction_id))
>   File 
> "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/runners/worker/bundle_processor.py",
>  line 593, in process_bundle
> data.ptransform_id].process_encoded(data.data)
>   File 
> 

[jira] [Updated] (BEAM-7975) error syncing pod - failed to start container artifact (python SDK)

2019-08-14 Thread James Hutchison (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Hutchison updated BEAM-7975:
--
Description: 
{code:java}
Error syncing pod 5966e59c ("-08131110-7hcg-harness-fbm2_default(5966e59c)"), skipping: failed to 
"StartContainer" for "artifact" with CrashLoopBackOff: "Back-off 5m0s 
restarting failed container=artifact pod=-08131110-7hcg-harness-fbm2_default(5966.e59c)"{code}
Seeing these in streaming pipeline. Running pipeline in batch mode I'm not 
seeing anything. Messages appear about every 0.5 - 5 seconds

I've been trying to efficiently scale my streaming pipeline and found that 
adding more workers / dividing into more groups isn't scaling as well as I 
expect. Perhaps this is contributing (how do I tell if workers are being 
utilized or not?)

One pipeline which never completed (got to one of the last steps and then log 
messages simply ceased without error on the workers) had this going on in the 
kubelet logs. I checked some of my other streaming pipelines and found the same 
thing going on, even though they would complete.

In a couple of my streaming pipelines, I've gotten the following error message, 
despite the pipeline eventually finishing:
{code:java}
Processing stuck in step s01 for at least 05m00s without outputting or 
completing in state process{code}
Perhaps they are related?

This is running with 5 or 7 (or more) workers in streaming mode. I don't see 
this when running with 1 worker

The pipeline uses requirements.txt and setup.py, as well as using an extra 
package and using save_main_session.

  was:
{code:java}
Error syncing pod 5966e59c ("-08131110-7hcg-harness-fbm2_default(5966e59c)"), skipping: failed to 
"StartContainer" for "artifact" with CrashLoopBackOff: "Back-off 5m0s 
restarting failed container=artifact pod=-08131110-7hcg-harness-fbm2_default(5966.e59c)"{code}
Seeing these in streaming pipeline. Running pipeline in batch mode I'm not 
seeing anything. Messages appear about every 0.5 - 5 seconds

I've been trying to efficiently scale my streaming pipeline and found that 
adding more workers / dividing into more groups seems to have minimal 
improvement. Perhaps this is part of the problem?

One pipeline which never completed (got to one of the last steps and then log 
messages simply ceased without error on the workers) had this going on in the 
kubelet logs. I checked some of my other streaming pipelines and found the same 
thing going on, even though they would complete.

In a couple of my streaming pipelines, I've gotten the following error message, 
despite the pipeline eventually finishing:
{code:java}
Processing stuck in step s01 for at least 05m00s without outputting or 
completing in state process{code}
Perhaps they are related?

This is running with 5 or 7 (or more) workers in streaming mode. I don't see 
this when running with 1 worker

The pipeline uses requirements.txt and setup.py, as well as using an extra 
package and using save_main_session.


> error syncing pod - failed to start container artifact (python SDK)
> ---
>
> Key: BEAM-7975
> URL: https://issues.apache.org/jira/browse/BEAM-7975
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-harness
>Affects Versions: 2.13.0
>Reporter: James Hutchison
>Priority: Major
>
> {code:java}
> Error syncing pod 5966e59c (" name>-08131110-7hcg-harness-fbm2_default(5966e59c)"), skipping: failed to 
> "StartContainer" for "artifact" with CrashLoopBackOff: "Back-off 5m0s 
> restarting failed container=artifact pod= name>-08131110-7hcg-harness-fbm2_default(5966.e59c)"{code}
> Seeing these in streaming pipeline. Running pipeline in batch mode I'm not 
> seeing anything. Messages appear about every 0.5 - 5 seconds
> I've been trying to efficiently scale my streaming pipeline and found that 
> adding more workers / dividing into more groups isn't scaling as well as I 
> expect. Perhaps this is contributing (how do I tell if workers are being 
> utilized or not?)
> One pipeline which never completed (got to one of the last steps and then log 
> messages simply ceased without error on the workers) had this going on in the 
> kubelet logs. I checked some of my other streaming pipelines and found the 
> same thing going on, even though they would complete.
> In a couple of my streaming pipelines, I've gotten the following error 
> message, despite the pipeline eventually finishing:
> {code:java}
> Processing stuck in step s01 for at least 05m00s without outputting or 
> completing in state process{code}
> Perhaps they are related?
> This is running with 5 or 7 (or more) workers in streaming mode. I don't see 
> this when running with 1 worker
> The pipeline uses requirements.txt and setup.py, as well as using an extra 
> 

[jira] [Updated] (BEAM-7934) Dataflow Python SDK logging: step_id is always empty string

2019-08-14 Thread James Hutchison (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Hutchison updated BEAM-7934:
--
Description: 
Using the dataflow runner, log messages always show up in stackdriver with the 
step_id as the empty string, so filtering log messages for a step doesn't work.
{code:java}
resource: {
  labels: {
job_id: "" 
job_name: "" 
project_id: "" 
region: "" 
step_id: "" 
  }
  type: "dataflow_step" 
}{code}
Another user seems to have posted in the old github repo and appears to be 
seeing the same problem based on their output:

[https://github.com/GoogleCloudPlatform/DataflowPythonSDK/issues/62]

>From what I can tell is only affecting streaming pipelines

  was:
Using the dataflow runner, log messages always show up in stackdriver with the 
step_id as the empty string, so filtering log messages for a step doesn't work.
{code:java}
resource: {
  labels: {
job_id: "" 
job_name: "" 
project_id: "" 
region: "" 
step_id: "" 
  }
  type: "dataflow_step" 
}{code}
Another user seems to have posted in the old github repo and appears to be 
seeing the same problem based on their output:



[https://github.com/GoogleCloudPlatform/DataflowPythonSDK/issues/62]


> Dataflow Python SDK logging: step_id is always empty string
> ---
>
> Key: BEAM-7934
> URL: https://issues.apache.org/jira/browse/BEAM-7934
> Project: Beam
>  Issue Type: Bug
>  Components: runner-dataflow, sdk-py-core
>Affects Versions: 2.13.0
>Reporter: James Hutchison
>Assignee: Ahmet Altay
>Priority: Major
>
> Using the dataflow runner, log messages always show up in stackdriver with 
> the step_id as the empty string, so filtering log messages for a step doesn't 
> work.
> {code:java}
> resource: {
>   labels: {
> job_id: "" 
> job_name: "" 
> project_id: "" 
> region: "" 
> step_id: "" 
>   }
>   type: "dataflow_step" 
> }{code}
> Another user seems to have posted in the old github repo and appears to be 
> seeing the same problem based on their output:
> [https://github.com/GoogleCloudPlatform/DataflowPythonSDK/issues/62]
> From what I can tell is only affecting streaming pipelines



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (BEAM-7930) bundle_processor log spam using python SDK on dataflow runner

2019-08-14 Thread James Hutchison (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-7930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907722#comment-16907722
 ] 

James Hutchison commented on BEAM-7930:
---

>From what I can tell this is coming from the grouping steps

> bundle_processor log spam using python SDK on dataflow runner
> -
>
> Key: BEAM-7930
> URL: https://issues.apache.org/jira/browse/BEAM-7930
> Project: Beam
>  Issue Type: Bug
>  Components: runner-dataflow, sdk-py-core
>Affects Versions: 2.13.0
>Reporter: James Hutchison
>Priority: Minor
>
> When running my pipeline on dataflow, I can see in the stackdriver logs a 
> large amount of spam for the following messages (note that the numbers in 
> them change every message):
>  * [INFO] (bundle_processor.create_operation) No unique name set for 
> transform generatedPtransform-67
>  * [INFO] (bundle_processor.create_operation) No unique name for transform -19
>  * [ERROR] (bundle_processor.create) Missing required coder_id on grpc_port 
> for -19; using deprecated fallback.
> I tried running locally using the debugger and setting breakpoints on where 
> these log messages originate using the direct runner and it never hit it, so 
> I don't know specifically what is causing them.
> I also tried using the logging module to change the threshold and also mocked 
> out the logging attribute in the bundle_processor module to change the log 
> level to CRITICAL and I still see the log messages.
> The pipeline is a streaming pipeline that reads from two pubsub topics, 
> merges the inputs and runs distinct on the inputs over each processing time 
> window, fetches from an external service, does processing, and inserts into 
> elasticsearch with failures going into bigquery. I notice the log messages 
> seem to cluster and this appears early on before any other log messages in 
> any of the other steps so I wonder if maybe this is coming from the pubsub 
> read or windowing portion.
> Expected behavior:
>  * I don't expect to see these noisy log messages which seem to indicate 
> something is wrong
>  * The missing required coder_id message is at the ERROR log level so it 
> pollutes the error logs. I would expect this to be at the WARNING or INFO 
> level.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-3342) Create a Cloud Bigtable IO connector for Python

2019-08-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-3342?focusedWorklogId=295111=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-295111
 ]

ASF GitHub Bot logged work on BEAM-3342:


Author: ASF GitHub Bot
Created on: 15/Aug/19 00:56
Start Date: 15/Aug/19 00:56
Worklog Time Spent: 10m 
  Work Description: pabloem commented on pull request #8457: [BEAM-3342] 
Create a Cloud Bigtable IO connector for Python
URL: https://github.com/apache/beam/pull/8457#discussion_r314140821
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigtableio_it_test.py
 ##
 @@ -0,0 +1,189 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+""" Integration test for GCP Bigtable testing."""
+from __future__ import absolute_import
+
+import argparse
+import datetime
+import logging
+import random
+import string
+import time
+import unittest
+from nose.plugins.attrib import attr
+
+import apache_beam as beam
+from apache_beam.metrics.metric import MetricsFilter
+from apache_beam.options.pipeline_options import PipelineOptions
+from apache_beam.runners.runner import PipelineState
+from apache_beam.testing.util import assert_that, equal_to
+from apache_beam.transforms.combiners import Count
+
+try:
+  from google.cloud.bigtable import enums, row, column_family, Client
+except ImportError:
+  Client = None
+
+import bigtableio
+
+class GenerateTestRows(beam.PTransform):
+  """ A PTransform to generate dummy rows to write to a Bigtable Table.
+
+  A PTransform that generates a list of `DirectRow` and writes it to a 
Bigtable Table.
+  """
+  def __init__(self):
+super(self.__class__, self).__init__()
+self.beam_options = {'project_id': PROJECT_ID,
+ 'instance_id': INSTANCE_ID,
+ 'table_id': TABLE_ID}
+
+  def _generate(self):
+for i in range(ROW_COUNT):
+  key = "key_%s" % ('{0:012}'.format(i))
+  test_row = row.DirectRow(row_key=key)
+  value = ''.join(random.choice(LETTERS_AND_DIGITS) for _ in 
range(CELL_SIZE))
+  for j in range(COLUMN_COUNT):
+test_row.set_cell(column_family_id=COLUMN_FAMILY_ID,
+  column=('field%s' % j).encode('utf-8'),
+  value=value,
+  timestamp=datetime.datetime.now())
+  yield test_row
+
+  def expand(self, pvalue):
+return (pvalue
+| beam.Create(self._generate())
+| 
bigtableio.WriteToBigTable(project_id=self.beam_options['project_id'],
+ 
instance_id=self.beam_options['instance_id'],
+ 
table_id=self.beam_options['table_id']))
+
+@unittest.skipIf(Client is None, 'GCP Bigtable dependencies are not installed')
+class BigtableIOTest(unittest.TestCase):
+  """ Bigtable IO Connector Test
+
+  This tests the connector both ways, first writing rows to a new table, then 
reading them and comparing the counters
+  """
+  def setUp(self):
+self.result = None
+self.table = Client(project=PROJECT_ID, admin=True)\
+.instance(instance_id=INSTANCE_ID)\
+.table(TABLE_ID)
+
+if not self.table.exists():
+  column_families = {COLUMN_FAMILY_ID: column_family.MaxVersionsGCRule(2)}
+  self.table.create(column_families=column_families)
+  logging.info('Table {} has been created!'.format(TABLE_ID))
+
+  @attr('IT')
+  def test_bigtable_io(self):
+print 'Project ID: ', PROJECT_ID
+print 'Instance ID:', INSTANCE_ID
+print 'Table ID:   ', TABLE_ID
 
 Review comment:
   Maybe avoid printing here, or use `logging.info` instead. If you do want to 
use print, use `print(...)` as a function (this is necessary for Py3 
compatibility).
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 295111)
Time 

[jira] [Work logged] (BEAM-3342) Create a Cloud Bigtable IO connector for Python

2019-08-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-3342?focusedWorklogId=295108=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-295108
 ]

ASF GitHub Bot logged work on BEAM-3342:


Author: ASF GitHub Bot
Created on: 15/Aug/19 00:55
Start Date: 15/Aug/19 00:55
Worklog Time Spent: 10m 
  Work Description: pabloem commented on pull request #8457: [BEAM-3342] 
Create a Cloud Bigtable IO connector for Python
URL: https://github.com/apache/beam/pull/8457#discussion_r314140722
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigtableio_it_test.py
 ##
 @@ -0,0 +1,189 @@
+#
 
 Review comment:
   There are lint errors. See them here: 
https://scans.gradle.com/s/fcjggzcdv6klo/console-log?task=:sdks:python:test-suites:tox:py2:lintPy27#L20
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 295108)
Time Spent: 35h 20m  (was: 35h 10m)

> Create a Cloud Bigtable IO connector for Python
> ---
>
> Key: BEAM-3342
> URL: https://issues.apache.org/jira/browse/BEAM-3342
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Solomon Duskis
>Assignee: Solomon Duskis
>Priority: Major
>  Time Spent: 35h 20m
>  Remaining Estimate: 0h
>
> I would like to create a Cloud Bigtable python connector.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-3342) Create a Cloud Bigtable IO connector for Python

2019-08-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-3342?focusedWorklogId=295109=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-295109
 ]

ASF GitHub Bot logged work on BEAM-3342:


Author: ASF GitHub Bot
Created on: 15/Aug/19 00:55
Start Date: 15/Aug/19 00:55
Worklog Time Spent: 10m 
  Work Description: pabloem commented on pull request #8457: [BEAM-3342] 
Create a Cloud Bigtable IO connector for Python
URL: https://github.com/apache/beam/pull/8457#discussion_r314141487
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigtableio_it_test.py
 ##
 @@ -0,0 +1,189 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+""" Integration test for GCP Bigtable testing."""
+from __future__ import absolute_import
+
+import argparse
+import datetime
+import logging
+import random
+import string
+import time
+import unittest
+from nose.plugins.attrib import attr
+
+import apache_beam as beam
+from apache_beam.metrics.metric import MetricsFilter
+from apache_beam.options.pipeline_options import PipelineOptions
+from apache_beam.runners.runner import PipelineState
+from apache_beam.testing.util import assert_that, equal_to
+from apache_beam.transforms.combiners import Count
+
+try:
+  from google.cloud.bigtable import enums, row, column_family, Client
+except ImportError:
+  Client = None
+
+import bigtableio
+
+class GenerateTestRows(beam.PTransform):
+  """ A PTransform to generate dummy rows to write to a Bigtable Table.
+
+  A PTransform that generates a list of `DirectRow` and writes it to a 
Bigtable Table.
+  """
+  def __init__(self):
+super(self.__class__, self).__init__()
+self.beam_options = {'project_id': PROJECT_ID,
+ 'instance_id': INSTANCE_ID,
+ 'table_id': TABLE_ID}
+
+  def _generate(self):
+for i in range(ROW_COUNT):
+  key = "key_%s" % ('{0:012}'.format(i))
+  test_row = row.DirectRow(row_key=key)
+  value = ''.join(random.choice(LETTERS_AND_DIGITS) for _ in 
range(CELL_SIZE))
+  for j in range(COLUMN_COUNT):
+test_row.set_cell(column_family_id=COLUMN_FAMILY_ID,
+  column=('field%s' % j).encode('utf-8'),
+  value=value,
+  timestamp=datetime.datetime.now())
+  yield test_row
+
+  def expand(self, pvalue):
+return (pvalue
+| beam.Create(self._generate())
+| 
bigtableio.WriteToBigTable(project_id=self.beam_options['project_id'],
+ 
instance_id=self.beam_options['instance_id'],
+ 
table_id=self.beam_options['table_id']))
+
+@unittest.skipIf(Client is None, 'GCP Bigtable dependencies are not installed')
+class BigtableIOTest(unittest.TestCase):
+  """ Bigtable IO Connector Test
+
+  This tests the connector both ways, first writing rows to a new table, then 
reading them and comparing the counters
+  """
+  def setUp(self):
+self.result = None
+self.table = Client(project=PROJECT_ID, admin=True)\
+.instance(instance_id=INSTANCE_ID)\
+.table(TABLE_ID)
+
+if not self.table.exists():
+  column_families = {COLUMN_FAMILY_ID: column_family.MaxVersionsGCRule(2)}
+  self.table.create(column_families=column_families)
+  logging.info('Table {} has been created!'.format(TABLE_ID))
+
+  @attr('IT')
+  def test_bigtable_io(self):
+print 'Project ID: ', PROJECT_ID
+print 'Instance ID:', INSTANCE_ID
+print 'Table ID:   ', TABLE_ID
+
+pipeline_options = 
PipelineOptions(pipeline_parameters(job_name=make_job_name()))
+p = beam.Pipeline(options=pipeline_options)
+_ = (p | 'Write Test Rows' >> GenerateTestRows())
+
+self.result = p.run()
+self.result.wait_until_finish()
+
+assert self.result.state == PipelineState.DONE
+
+if not hasattr(self.result, 'has_job') or self.result.has_job:
+  query_result = 
self.result.metrics().query(MetricsFilter().with_name('Written Row'))
+  if query_result['counters']:
+read_counter = query_result['counters'][0]
+logging.info('Number 

[jira] [Work logged] (BEAM-3342) Create a Cloud Bigtable IO connector for Python

2019-08-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-3342?focusedWorklogId=295110=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-295110
 ]

ASF GitHub Bot logged work on BEAM-3342:


Author: ASF GitHub Bot
Created on: 15/Aug/19 00:55
Start Date: 15/Aug/19 00:55
Worklog Time Spent: 10m 
  Work Description: pabloem commented on pull request #8457: [BEAM-3342] 
Create a Cloud Bigtable IO connector for Python
URL: https://github.com/apache/beam/pull/8457#discussion_r314140638
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigtableio_it_test.py
 ##
 @@ -0,0 +1,189 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+""" Integration test for GCP Bigtable testing."""
+from __future__ import absolute_import
+
+import argparse
+import datetime
+import logging
+import random
+import string
+import time
+import unittest
+from nose.plugins.attrib import attr
+
+import apache_beam as beam
+from apache_beam.metrics.metric import MetricsFilter
+from apache_beam.options.pipeline_options import PipelineOptions
+from apache_beam.runners.runner import PipelineState
+from apache_beam.testing.util import assert_that, equal_to
+from apache_beam.transforms.combiners import Count
+
+try:
+  from google.cloud.bigtable import enums, row, column_family, Client
+except ImportError:
+  Client = None
+
+import bigtableio
 
 Review comment:
   Please avoid relative imports. Import via apache_beam.io.gcp.bigtableio
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 295110)
Time Spent: 35h 40m  (was: 35.5h)

> Create a Cloud Bigtable IO connector for Python
> ---
>
> Key: BEAM-3342
> URL: https://issues.apache.org/jira/browse/BEAM-3342
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Solomon Duskis
>Assignee: Solomon Duskis
>Priority: Major
>  Time Spent: 35h 40m
>  Remaining Estimate: 0h
>
> I would like to create a Cloud Bigtable python connector.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-3342) Create a Cloud Bigtable IO connector for Python

2019-08-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-3342?focusedWorklogId=295107=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-295107
 ]

ASF GitHub Bot logged work on BEAM-3342:


Author: ASF GitHub Bot
Created on: 15/Aug/19 00:55
Start Date: 15/Aug/19 00:55
Worklog Time Spent: 10m 
  Work Description: pabloem commented on pull request #8457: [BEAM-3342] 
Create a Cloud Bigtable IO connector for Python
URL: https://github.com/apache/beam/pull/8457#discussion_r314140821
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigtableio_it_test.py
 ##
 @@ -0,0 +1,189 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+""" Integration test for GCP Bigtable testing."""
+from __future__ import absolute_import
+
+import argparse
+import datetime
+import logging
+import random
+import string
+import time
+import unittest
+from nose.plugins.attrib import attr
+
+import apache_beam as beam
+from apache_beam.metrics.metric import MetricsFilter
+from apache_beam.options.pipeline_options import PipelineOptions
+from apache_beam.runners.runner import PipelineState
+from apache_beam.testing.util import assert_that, equal_to
+from apache_beam.transforms.combiners import Count
+
+try:
+  from google.cloud.bigtable import enums, row, column_family, Client
+except ImportError:
+  Client = None
+
+import bigtableio
+
+class GenerateTestRows(beam.PTransform):
+  """ A PTransform to generate dummy rows to write to a Bigtable Table.
+
+  A PTransform that generates a list of `DirectRow` and writes it to a 
Bigtable Table.
+  """
+  def __init__(self):
+super(self.__class__, self).__init__()
+self.beam_options = {'project_id': PROJECT_ID,
+ 'instance_id': INSTANCE_ID,
+ 'table_id': TABLE_ID}
+
+  def _generate(self):
+for i in range(ROW_COUNT):
+  key = "key_%s" % ('{0:012}'.format(i))
+  test_row = row.DirectRow(row_key=key)
+  value = ''.join(random.choice(LETTERS_AND_DIGITS) for _ in 
range(CELL_SIZE))
+  for j in range(COLUMN_COUNT):
+test_row.set_cell(column_family_id=COLUMN_FAMILY_ID,
+  column=('field%s' % j).encode('utf-8'),
+  value=value,
+  timestamp=datetime.datetime.now())
+  yield test_row
+
+  def expand(self, pvalue):
+return (pvalue
+| beam.Create(self._generate())
+| 
bigtableio.WriteToBigTable(project_id=self.beam_options['project_id'],
+ 
instance_id=self.beam_options['instance_id'],
+ 
table_id=self.beam_options['table_id']))
+
+@unittest.skipIf(Client is None, 'GCP Bigtable dependencies are not installed')
+class BigtableIOTest(unittest.TestCase):
+  """ Bigtable IO Connector Test
+
+  This tests the connector both ways, first writing rows to a new table, then 
reading them and comparing the counters
+  """
+  def setUp(self):
+self.result = None
+self.table = Client(project=PROJECT_ID, admin=True)\
+.instance(instance_id=INSTANCE_ID)\
+.table(TABLE_ID)
+
+if not self.table.exists():
+  column_families = {COLUMN_FAMILY_ID: column_family.MaxVersionsGCRule(2)}
+  self.table.create(column_families=column_families)
+  logging.info('Table {} has been created!'.format(TABLE_ID))
+
+  @attr('IT')
+  def test_bigtable_io(self):
+print 'Project ID: ', PROJECT_ID
+print 'Instance ID:', INSTANCE_ID
+print 'Table ID:   ', TABLE_ID
 
 Review comment:
   Maybe avoid printing here, or use `logging.info` instead. If you do want to 
use print, use `print(...)` as a function
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 295107)

> Create a Cloud Bigtable IO connector for Python
> 

[jira] [Work logged] (BEAM-5980) Add load tests for Core Apache Beam opertaions

2019-08-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-5980?focusedWorklogId=295104=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-295104
 ]

ASF GitHub Bot logged work on BEAM-5980:


Author: ASF GitHub Bot
Created on: 15/Aug/19 00:43
Start Date: 15/Aug/19 00:43
Worklog Time Spent: 10m 
  Work Description: pabloem commented on issue #9286: [BEAM-5980] Remove 
redundant combine tests
URL: https://github.com/apache/beam/pull/9286#issuecomment-521471230
 
 
   LGTM. Feel free to self-merge if necessary.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 295104)
Time Spent: 1h 50m  (was: 1h 40m)

> Add load tests for Core Apache Beam opertaions 
> ---
>
> Key: BEAM-5980
> URL: https://issues.apache.org/jira/browse/BEAM-5980
> Project: Beam
>  Issue Type: New Feature
>  Components: testing
>Reporter: Lukasz Gajowy
>Assignee: Lukasz Gajowy
>Priority: Major
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> This involves adding a suite of load tests described in this proposal: 
> [https://s.apache.org/load-test-basic-operations]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-08-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=295097=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-295097
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 15/Aug/19 00:32
Start Date: 15/Aug/19 00:32
Worklog Time Spent: 10m 
  Work Description: robinyqiu commented on issue #9144: [BEAM-7013] 
Integrating ZetaSketch's HLL++ algorithm with Beam
URL: https://github.com/apache/beam/pull/9144#issuecomment-521469020
 
 
   Run Java PostCommit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 295097)
Time Spent: 16h 40m  (was: 16.5h)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 16h 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7909) Write integration tests to test customized containers

2019-08-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7909?focusedWorklogId=295094=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-295094
 ]

ASF GitHub Bot logged work on BEAM-7909:


Author: ASF GitHub Bot
Created on: 15/Aug/19 00:16
Start Date: 15/Aug/19 00:16
Worklog Time Spent: 10m 
  Work Description: Hannah-Jiang commented on issue #9335: [WIP][BEAM-7909] 
customized containers for python
URL: https://github.com/apache/beam/pull/9335#issuecomment-521465832
 
 
   Run Portable_Python PreCommit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 295094)
Time Spent: 20m  (was: 10m)

> Write integration tests to test customized containers
> -
>
> Key: BEAM-7909
> URL: https://issues.apache.org/jira/browse/BEAM-7909
> Project: Beam
>  Issue Type: Sub-task
>  Components: build-system
>Reporter: Hannah Jiang
>Assignee: Hannah Jiang
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (BEAM-7981) ParDo function wrapper doesn't support Iterable output types

2019-08-14 Thread Ahmet Altay (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-7981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907699#comment-16907699
 ] 

Ahmet Altay commented on BEAM-7981:
---

Just for clarification, [~udim] are you planning to work on this?

> ParDo function wrapper doesn't support Iterable output types
> 
>
> Key: BEAM-7981
> URL: https://issues.apache.org/jira/browse/BEAM-7981
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Udi Meiri
>Priority: Major
>
> I believe the bug is in CallableWrapperDoFn.default_type_hints, which 
> converts Iterable[str] to str.
> This test will be included (commented out) in 
> https://github.com/apache/beam/pull/9283
> {code}
>   def test_typed_callable_iterable_output(self):
> @typehints.with_input_types(int)
> @typehints.with_output_types(typehints.Iterable[str])
> def do_fn(element):
>   return [[str(element)] * 2]
> result = [1, 2] | beam.ParDo(do_fn)
> self.assertEqual([['1', '1'], ['2', '2']], sorted(result))
> {code}
> Result:
> {code}
> ==
> ERROR: test_typed_callable_iterable_output 
> (apache_beam.typehints.typed_pipeline_test.MainInputTest)
> --
> Traceback (most recent call last):
>   File 
> "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/typehints/typed_pipeline_test.py",
>  line 104, in test_typed_callable_iterable_output
> result = [1, 2] | beam.ParDo(do_fn)
>   File 
> "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/transforms/ptransform.py",
>  line 519, in __ror__
> p.run().wait_until_finish()
>   File 
> "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/pipeline.py", 
> line 406, in run
> self._options).run(False)
>   File 
> "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/pipeline.py", 
> line 419, in run
> return self.runner.run_pipeline(self, self._options)
>   File 
> "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/runners/direct/direct_runner.py",
>  line 129, in run_pipeline
> return runner.run_pipeline(pipeline, options)
>   File 
> "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/runners/portability/fn_api_runner.py",
>  line 366, in run_pipeline
> default_environment=self._default_environment))
>   File 
> "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/runners/portability/fn_api_runner.py",
>  line 373, in run_via_runner_api
> return self.run_stages(stage_context, stages)
>   File 
> "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/runners/portability/fn_api_runner.py",
>  line 455, in run_stages
> stage_context.safe_coders)
>   File 
> "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/runners/portability/fn_api_runner.py",
>  line 733, in _run_stage
> result, splits = bundle_manager.process_bundle(data_input, data_output)
>   File 
> "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/runners/portability/fn_api_runner.py",
>  line 1663, in process_bundle
> part, expected_outputs), part_inputs):
>   File "/usr/lib/python3.7/concurrent/futures/_base.py", line 586, in 
> result_iterator
> yield fs.pop().result()
>   File "/usr/lib/python3.7/concurrent/futures/_base.py", line 432, in result
> return self.__get_result()
>   File "/usr/lib/python3.7/concurrent/futures/_base.py", line 384, in 
> __get_result
> raise self._exception
>   File "/usr/lib/python3.7/concurrent/futures/thread.py", line 57, in run
> result = self.fn(*self.args, **self.kwargs)
>   File 
> "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/runners/portability/fn_api_runner.py",
>  line 1663, in 
> part, expected_outputs), part_inputs):
>   File 
> "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/runners/portability/fn_api_runner.py",
>  line 1601, in process_bundle
> result_future = self._worker_handler.control_conn.push(process_bundle_req)
>   File 
> "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/runners/portability/fn_api_runner.py",
>  line 1080, in push
> response = self.worker.do_instruction(request)
>   File 
> "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/runners/worker/sdk_worker.py",
>  line 343, in do_instruction
> request.instruction_id)
>   File 
> "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/runners/worker/sdk_worker.py",
>  line 369, in process_bundle
> bundle_processor.process_bundle(instruction_id))
>   File 
> "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/runners/worker/bundle_processor.py",
>  line 593, in process_bundle
> data.ptransform_id].process_encoded(data.data)
>   

[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-08-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=295089=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-295089
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 14/Aug/19 23:52
Start Date: 14/Aug/19 23:52
Worklog Time Spent: 10m 
  Work Description: robinyqiu commented on issue #9144: [BEAM-7013] 
Integrating ZetaSketch's HLL++ algorithm with Beam
URL: https://github.com/apache/beam/pull/9144#issuecomment-521460107
 
 
   Run Java PostCommit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 295089)
Time Spent: 16.5h  (was: 16h 20m)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 16.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-08-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=295087=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-295087
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 14/Aug/19 23:46
Start Date: 14/Aug/19 23:46
Worklog Time Spent: 10m 
  Work Description: robinyqiu commented on issue #9144: [BEAM-7013] 
Integrating ZetaSketch's HLL++ algorithm with Beam
URL: https://github.com/apache/beam/pull/9144#issuecomment-521460107
 
 
   Run Java PostCommit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 295087)
Time Spent: 16h 20m  (was: 16h 10m)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 16h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-08-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=295086=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-295086
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 14/Aug/19 23:46
Start Date: 14/Aug/19 23:46
Worklog Time Spent: 10m 
  Work Description: robinyqiu commented on issue #9144: [BEAM-7013] 
Integrating ZetaSketch's HLL++ algorithm with Beam
URL: https://github.com/apache/beam/pull/9144#issuecomment-521459904
 
 
   Run Java PostCommit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 295086)
Time Spent: 16h 10m  (was: 16h)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 16h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7013) A new count distinct transform based on BigQuery compatible HyperLogLog++ implementation

2019-08-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7013?focusedWorklogId=295084=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-295084
 ]

ASF GitHub Bot logged work on BEAM-7013:


Author: ASF GitHub Bot
Created on: 14/Aug/19 23:45
Start Date: 14/Aug/19 23:45
Worklog Time Spent: 10m 
  Work Description: robinyqiu commented on issue #9144: [BEAM-7013] 
Integrating ZetaSketch's HLL++ algorithm with Beam
URL: https://github.com/apache/beam/pull/9144#issuecomment-521459904
 
 
   Run Java PostCommit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 295084)
Time Spent: 16h  (was: 15h 50m)

> A new count distinct transform based on BigQuery compatible HyperLogLog++ 
> implementation
> 
>
> Key: BEAM-7013
> URL: https://issues.apache.org/jira/browse/BEAM-7013
> Project: Beam
>  Issue Type: New Feature
>  Components: extensions-java-sketching, sdk-java-core
>Reporter: Yueyang Qiu
>Assignee: Yueyang Qiu
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 16h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (BEAM-7983) Template parameters don't work if they are only used in DoFns

2019-08-14 Thread Yunqing Zhou (JIRA)
Yunqing Zhou created BEAM-7983:
--

 Summary: Template parameters don't work if they are only used in 
DoFns
 Key: BEAM-7983
 URL: https://issues.apache.org/jira/browse/BEAM-7983
 Project: Beam
  Issue Type: Bug
  Components: sdk-java-core
Reporter: Yunqing Zhou
Assignee: Luke Cwik


Template parameters don't work if they are only used in DoFns but not anywhere 
else in main.

Sample pipeline:

 
{code:java}
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.options.ValueProvider;
import org.apache.beam.sdk.transforms.Create;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.ParDo;

public class BugPipeline {
  public interface Options extends PipelineOptions {
ValueProvider getFoo();
void setFoo(ValueProvider foo);
  }
  public static void main(String[] args) throws Exception {
Options options = PipelineOptionsFactory.fromArgs(args).as(Options.class);
Pipeline p = Pipeline.create(options);
p.apply(Create.of(1)).apply(ParDo.of(new DoFn() {
  @ProcessElement
  public void processElement(ProcessContext context) {

System.out.println(context.getPipelineOptions().as(Options.class).getFoo());
  }   
}));
p.run();


  
  }

}

{code}

Option "foo" is not used anywhere else than the DoFn. So to reproduce the 
problem:
{code:bash}
$java BugPipeline --project=$PROJECT --stagingLocation=$STAGING 
--templateLocation=$TEMPLATE --runner=DataflowRunner
$gcloud dataflow jobs run $NAME --gcs-location=$TEMPLATE --parameters=foo=bar
{code}

it will fail w/ this error:
{code}
ERROR: (gcloud.dataflow.jobs.run) INVALID_ARGUMENT: (2621bec26c2488b7): The 
workflow could not be created. Causes: (2621bec26c248dba): Found unexpected 
parameters: ['foo' (perhaps you meant 'zone')]
- '@type': type.googleapis.com/google.rpc.DebugInfo
  detail: "(2621bec26c2488b7): The workflow could not be created. Causes: 
(2621bec26c248dba):\
\ Found unexpected parameters: ['foo' (perhaps you meant 'zone')]"
{code}

The underlying problem is that ProxyInvocationHandler.java only populate 
options which are "invoked" to the pipeline option map in the job object:
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/options/ProxyInvocationHandler.java#L159

One way to solve it is to save all ValueProvider type of params in the 
pipelineoptions section. Alternatively, some registration mechanism can be 
introduced.

A current workaround is to annotate the parameter with 
{code}@Validation.Required{code}, which will call invoke() behind the scene.





--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (BEAM-7982) Dataflow runner needs to identify the new format of metric names for distribution metrics

2019-08-14 Thread David Yan (JIRA)
David Yan created BEAM-7982:
---

 Summary: Dataflow runner needs to identify the new format of 
metric names for distribution metrics
 Key: BEAM-7982
 URL: https://issues.apache.org/jira/browse/BEAM-7982
 Project: Beam
  Issue Type: Improvement
  Components: runner-dataflow
Reporter: David Yan


For example, 
[https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/dataflow/dataflow_metrics.py#L157]

uses [MAX], [MIN], etc. but the new format will be _MAX, _MIN, etc.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (BEAM-7049) Merge multiple input to one BeamUnionRel

2019-08-14 Thread sridhar Reddy (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-7049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907689#comment-16907689
 ] 

sridhar Reddy commented on BEAM-7049:
-

Good idea! I will create one

> Merge multiple input to one BeamUnionRel
> 
>
> Key: BEAM-7049
> URL: https://issues.apache.org/jira/browse/BEAM-7049
> Project: Beam
>  Issue Type: Improvement
>  Components: dsl-sql
>Reporter: Rui Wang
>Assignee: sridhar Reddy
>Priority: Major
>
> BeamUnionRel assumes inputs are two and rejects more. So `a UNION b UNION c` 
> will have to be created as UNION(a, UNION(b, c)) and have two shuffles. If 
> BeamUnionRel can handle multiple shuffles, we will have only one shuffle



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (BEAM-7981) ParDo function wrapper doesn't support Iterable output types

2019-08-14 Thread Udi Meiri (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udi Meiri updated BEAM-7981:

Summary: ParDo function wrapper doesn't support Iterable output types  
(was: CallableWrapperDoFn)

> ParDo function wrapper doesn't support Iterable output types
> 
>
> Key: BEAM-7981
> URL: https://issues.apache.org/jira/browse/BEAM-7981
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Udi Meiri
>Priority: Major
>
> I believe the bug is in CallableWrapperDoFn.default_type_hints, which 
> converts Iterable[str] to str.
> This test will be included (commented out) in 
> https://github.com/apache/beam/pull/9283
> {code}
>   def test_typed_callable_iterable_output(self):
> @typehints.with_input_types(int)
> @typehints.with_output_types(typehints.Iterable[str])
> def do_fn(element):
>   return [[str(element)] * 2]
> result = [1, 2] | beam.ParDo(do_fn)
> self.assertEqual([['1', '1'], ['2', '2']], sorted(result))
> {code}
> Result:
> {code}
> ==
> ERROR: test_typed_callable_iterable_output 
> (apache_beam.typehints.typed_pipeline_test.MainInputTest)
> --
> Traceback (most recent call last):
>   File 
> "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/typehints/typed_pipeline_test.py",
>  line 104, in test_typed_callable_iterable_output
> result = [1, 2] | beam.ParDo(do_fn)
>   File 
> "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/transforms/ptransform.py",
>  line 519, in __ror__
> p.run().wait_until_finish()
>   File 
> "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/pipeline.py", 
> line 406, in run
> self._options).run(False)
>   File 
> "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/pipeline.py", 
> line 419, in run
> return self.runner.run_pipeline(self, self._options)
>   File 
> "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/runners/direct/direct_runner.py",
>  line 129, in run_pipeline
> return runner.run_pipeline(pipeline, options)
>   File 
> "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/runners/portability/fn_api_runner.py",
>  line 366, in run_pipeline
> default_environment=self._default_environment))
>   File 
> "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/runners/portability/fn_api_runner.py",
>  line 373, in run_via_runner_api
> return self.run_stages(stage_context, stages)
>   File 
> "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/runners/portability/fn_api_runner.py",
>  line 455, in run_stages
> stage_context.safe_coders)
>   File 
> "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/runners/portability/fn_api_runner.py",
>  line 733, in _run_stage
> result, splits = bundle_manager.process_bundle(data_input, data_output)
>   File 
> "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/runners/portability/fn_api_runner.py",
>  line 1663, in process_bundle
> part, expected_outputs), part_inputs):
>   File "/usr/lib/python3.7/concurrent/futures/_base.py", line 586, in 
> result_iterator
> yield fs.pop().result()
>   File "/usr/lib/python3.7/concurrent/futures/_base.py", line 432, in result
> return self.__get_result()
>   File "/usr/lib/python3.7/concurrent/futures/_base.py", line 384, in 
> __get_result
> raise self._exception
>   File "/usr/lib/python3.7/concurrent/futures/thread.py", line 57, in run
> result = self.fn(*self.args, **self.kwargs)
>   File 
> "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/runners/portability/fn_api_runner.py",
>  line 1663, in 
> part, expected_outputs), part_inputs):
>   File 
> "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/runners/portability/fn_api_runner.py",
>  line 1601, in process_bundle
> result_future = self._worker_handler.control_conn.push(process_bundle_req)
>   File 
> "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/runners/portability/fn_api_runner.py",
>  line 1080, in push
> response = self.worker.do_instruction(request)
>   File 
> "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/runners/worker/sdk_worker.py",
>  line 343, in do_instruction
> request.instruction_id)
>   File 
> "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/runners/worker/sdk_worker.py",
>  line 369, in process_bundle
> bundle_processor.process_bundle(instruction_id))
>   File 
> "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/runners/worker/bundle_processor.py",
>  line 593, in process_bundle
> data.ptransform_id].process_encoded(data.data)
>   File 
> 

[jira] [Created] (BEAM-7981) CallableWrapperDoFn

2019-08-14 Thread Udi Meiri (JIRA)
Udi Meiri created BEAM-7981:
---

 Summary: CallableWrapperDoFn
 Key: BEAM-7981
 URL: https://issues.apache.org/jira/browse/BEAM-7981
 Project: Beam
  Issue Type: Bug
  Components: sdk-py-core
Reporter: Udi Meiri


I believe the bug is in CallableWrapperDoFn.default_type_hints, which converts 
Iterable[str] to str.

This test will be included (commented out) in 
https://github.com/apache/beam/pull/9283

{code}
  def test_typed_callable_iterable_output(self):
@typehints.with_input_types(int)
@typehints.with_output_types(typehints.Iterable[str])
def do_fn(element):
  return [[str(element)] * 2]

result = [1, 2] | beam.ParDo(do_fn)
self.assertEqual([['1', '1'], ['2', '2']], sorted(result))
{code}

Result:
{code}
==
ERROR: test_typed_callable_iterable_output 
(apache_beam.typehints.typed_pipeline_test.MainInputTest)
--
Traceback (most recent call last):
  File 
"/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/typehints/typed_pipeline_test.py",
 line 104, in test_typed_callable_iterable_output
result = [1, 2] | beam.ParDo(do_fn)
  File 
"/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/transforms/ptransform.py",
 line 519, in __ror__
p.run().wait_until_finish()
  File 
"/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/pipeline.py", 
line 406, in run
self._options).run(False)
  File 
"/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/pipeline.py", 
line 419, in run
return self.runner.run_pipeline(self, self._options)
  File 
"/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/runners/direct/direct_runner.py",
 line 129, in run_pipeline
return runner.run_pipeline(pipeline, options)
  File 
"/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/runners/portability/fn_api_runner.py",
 line 366, in run_pipeline
default_environment=self._default_environment))
  File 
"/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/runners/portability/fn_api_runner.py",
 line 373, in run_via_runner_api
return self.run_stages(stage_context, stages)
  File 
"/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/runners/portability/fn_api_runner.py",
 line 455, in run_stages
stage_context.safe_coders)
  File 
"/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/runners/portability/fn_api_runner.py",
 line 733, in _run_stage
result, splits = bundle_manager.process_bundle(data_input, data_output)
  File 
"/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/runners/portability/fn_api_runner.py",
 line 1663, in process_bundle
part, expected_outputs), part_inputs):
  File "/usr/lib/python3.7/concurrent/futures/_base.py", line 586, in 
result_iterator
yield fs.pop().result()
  File "/usr/lib/python3.7/concurrent/futures/_base.py", line 432, in result
return self.__get_result()
  File "/usr/lib/python3.7/concurrent/futures/_base.py", line 384, in 
__get_result
raise self._exception
  File "/usr/lib/python3.7/concurrent/futures/thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
  File 
"/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/runners/portability/fn_api_runner.py",
 line 1663, in 
part, expected_outputs), part_inputs):
  File 
"/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/runners/portability/fn_api_runner.py",
 line 1601, in process_bundle
result_future = self._worker_handler.control_conn.push(process_bundle_req)
  File 
"/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/runners/portability/fn_api_runner.py",
 line 1080, in push
response = self.worker.do_instruction(request)
  File 
"/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/runners/worker/sdk_worker.py",
 line 343, in do_instruction
request.instruction_id)
  File 
"/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/runners/worker/sdk_worker.py",
 line 369, in process_bundle
bundle_processor.process_bundle(instruction_id))
  File 
"/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/runners/worker/bundle_processor.py",
 line 593, in process_bundle
data.ptransform_id].process_encoded(data.data)
  File 
"/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/runners/worker/bundle_processor.py",
 line 143, in process_encoded
self.output(decoded_value)
  File 
"/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/runners/worker/operations.py",
 line 256, in output
cython.cast(Receiver, self.receivers[output_index]).receive(windowed_value)
  File 
"/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/runners/worker/operations.py",
 line 143, in receive
self.consumer.process(windowed_value)
  File 

[jira] [Work logged] (BEAM-7969) Streaming Dataflow worker doesn't report FnAPI metrics.

2019-08-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7969?focusedWorklogId=295068=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-295068
 ]

ASF GitHub Bot logged work on BEAM-7969:


Author: ASF GitHub Bot
Created on: 14/Aug/19 22:22
Start Date: 14/Aug/19 22:22
Worklog Time Spent: 10m 
  Work Description: Ardagan commented on issue #9330: [BEAM-7969] Report 
FnAPI counters as deltas in streaming jobs.
URL: https://github.com/apache/beam/pull/9330#issuecomment-521442339
 
 
   > This map uses `MetricKey` as the key which can do not contain sdkHarness 
id and can potentially have name collision across multiple sdkHarness.
   > We use multiple SdkHarness in case of python SDK which will have this 
problem.
   
   Can you elaborate on the problem a bit please? IIUC different SDK harnesses 
will handle different steps, so metric keys will not overlap anyway.
   
   Also I feel fixing that issue better be done in separate PR.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 295068)
Time Spent: 2h  (was: 1h 50m)

> Streaming Dataflow worker doesn't report FnAPI metrics.
> ---
>
> Key: BEAM-7969
> URL: https://issues.apache.org/jira/browse/BEAM-7969
> Project: Beam
>  Issue Type: Bug
>  Components: java-fn-execution, runner-dataflow
>Reporter: Mikhail Gryzykhin
>Assignee: Mikhail Gryzykhin
>Priority: Major
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> EOM



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7969) Streaming Dataflow worker doesn't report FnAPI metrics.

2019-08-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7969?focusedWorklogId=295067=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-295067
 ]

ASF GitHub Bot logged work on BEAM-7969:


Author: ASF GitHub Bot
Created on: 14/Aug/19 22:19
Start Date: 14/Aug/19 22:19
Worklog Time Spent: 10m 
  Work Description: Ardagan commented on pull request #9330: [BEAM-7969] 
Report FnAPI counters as deltas in streaming jobs.
URL: https://github.com/apache/beam/pull/9330#discussion_r314108230
 
 

 ##
 File path: 
runners/google-cloud-dataflow-java/worker/src/main/java/org/apache/beam/runners/dataflow/worker/StreamingDataflowWorker.java
 ##
 @@ -1867,6 +1873,28 @@ private void sendWorkerUpdatesToDataflowService(
   cumulativeCounters.extractUpdates(false, 
DataflowCounterUpdateExtractor.INSTANCE));
   counterUpdates.addAll(
   
deltaCounters.extractModifiedDeltaUpdates(DataflowCounterUpdateExtractor.INSTANCE));
+  if (hasExperiment(options, "beam_fn_api")) {
+while (!this.pendingMonitoringInfos.isEmpty()) {
+  final CounterUpdate item = this.pendingMonitoringInfos.poll();
 
 Review comment:
   We are good here since we first initialize class and metric and only then 
start multiple threads.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 295067)
Time Spent: 1h 50m  (was: 1h 40m)

> Streaming Dataflow worker doesn't report FnAPI metrics.
> ---
>
> Key: BEAM-7969
> URL: https://issues.apache.org/jira/browse/BEAM-7969
> Project: Beam
>  Issue Type: Bug
>  Components: java-fn-execution, runner-dataflow
>Reporter: Mikhail Gryzykhin
>Assignee: Mikhail Gryzykhin
>Priority: Major
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> EOM



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7969) Streaming Dataflow worker doesn't report FnAPI metrics.

2019-08-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7969?focusedWorklogId=295066=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-295066
 ]

ASF GitHub Bot logged work on BEAM-7969:


Author: ASF GitHub Bot
Created on: 14/Aug/19 22:17
Start Date: 14/Aug/19 22:17
Worklog Time Spent: 10m 
  Work Description: Ardagan commented on pull request #9330: [BEAM-7969] 
Report FnAPI counters as deltas in streaming jobs.
URL: https://github.com/apache/beam/pull/9330#discussion_r314107619
 
 

 ##
 File path: 
runners/google-cloud-dataflow-java/worker/src/main/java/org/apache/beam/runners/dataflow/worker/StreamingDataflowWorker.java
 ##
 @@ -1867,6 +1873,28 @@ private void sendWorkerUpdatesToDataflowService(
   cumulativeCounters.extractUpdates(false, 
DataflowCounterUpdateExtractor.INSTANCE));
   counterUpdates.addAll(
   
deltaCounters.extractModifiedDeltaUpdates(DataflowCounterUpdateExtractor.INSTANCE));
+  if (hasExperiment(options, "beam_fn_api")) {
+while (!this.pendingMonitoringInfos.isEmpty()) {
+  final CounterUpdate item = this.pendingMonitoringInfos.poll();
+
+  // This change will treat counter as delta.
+  // This is required because we receive cumulative results from FnAPI 
harness,
+  // while streaming job is expected to receive delta updates to 
counters on same
+  // WorkItem.
+  if (item.getCumulative()) {
+item.setCumulative(false);
+  } else {
+// In current world all counters coming from FnAPI are cumulative.
+// This is a safety check in case new counter type appears in 
FnAPI.
+throw new UnsupportedOperationException(
 
 Review comment:
   It is better to fail fast in this case. New type of counter appears only 
when someone modifies SDK, so it comes from beam developer, not user.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 295066)
Time Spent: 1h 40m  (was: 1.5h)

> Streaming Dataflow worker doesn't report FnAPI metrics.
> ---
>
> Key: BEAM-7969
> URL: https://issues.apache.org/jira/browse/BEAM-7969
> Project: Beam
>  Issue Type: Bug
>  Components: java-fn-execution, runner-dataflow
>Reporter: Mikhail Gryzykhin
>Assignee: Mikhail Gryzykhin
>Priority: Major
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> EOM



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (BEAM-7866) Python MongoDB IO performance and correctness issues

2019-08-14 Thread yifan zou (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yifan zou resolved BEAM-7866.
-
Resolution: Fixed

> Python MongoDB IO performance and correctness issues
> 
>
> Key: BEAM-7866
> URL: https://issues.apache.org/jira/browse/BEAM-7866
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Eugene Kirpichov
>Assignee: Yichi Zhang
>Priority: Major
> Fix For: 2.15.0
>
>  Time Spent: 12h
>  Remaining Estimate: 0h
>
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/mongodbio.py
>  splits the query result by computing number of results in constructor, and 
> then in each reader re-executing the whole query and getting an index 
> sub-range of those results.
> This is broken in several critical ways:
> - The order of query results returned by find() is not necessarily 
> deterministic, so the idea of index ranges on it is meaningless: each shard 
> may basically get random, possibly overlapping subsets of the total results
> - Even if you add order by `_id`, the database may be changing concurrently 
> to reading and splitting. E.g. if the database contained documents with ids 
> 10 20 30 40 50, and this was split into shards 0..2 and 3..5 (under the 
> assumption that these shards would contain respectively 10 20 30, and 40 50), 
> and then suppose shard 10 20 30 is read and then document 25 is inserted - 
> then the 3..5 shard will read 30 40 50, i.e. document 30 is duplicated and 
> document 25 is lost.
> - Every shard re-executes the query and skips the first start_offset items, 
> which in total is quadratic complexity
> - The query is first executed in the constructor in order to count results, 
> which 1) means the constructor can be super slow and 2) it won't work at all 
> if the database is unavailable at the time the pipeline is constructed (e.g. 
> if this is a template).
> Unfortunately, none of these issues are caught by SourceTestUtils: this class 
> has extensive coverage with it, and the tests pass. This is because the tests 
> return the same results in the same order. I don't know how to catch this 
> automatically, and I don't know how to catch the performance issue 
> automatically, but these would all be important follow-up items after the 
> actual fix.
> CC: [~chamikara] as reviewer.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (BEAM-7866) Python MongoDB IO performance and correctness issues

2019-08-14 Thread yifan zou (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-7866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907646#comment-16907646
 ] 

yifan zou commented on BEAM-7866:
-

The PR got merged. I mark this ticket as resolved.

> Python MongoDB IO performance and correctness issues
> 
>
> Key: BEAM-7866
> URL: https://issues.apache.org/jira/browse/BEAM-7866
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Eugene Kirpichov
>Assignee: Yichi Zhang
>Priority: Major
> Fix For: 2.15.0
>
>  Time Spent: 12h
>  Remaining Estimate: 0h
>
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/mongodbio.py
>  splits the query result by computing number of results in constructor, and 
> then in each reader re-executing the whole query and getting an index 
> sub-range of those results.
> This is broken in several critical ways:
> - The order of query results returned by find() is not necessarily 
> deterministic, so the idea of index ranges on it is meaningless: each shard 
> may basically get random, possibly overlapping subsets of the total results
> - Even if you add order by `_id`, the database may be changing concurrently 
> to reading and splitting. E.g. if the database contained documents with ids 
> 10 20 30 40 50, and this was split into shards 0..2 and 3..5 (under the 
> assumption that these shards would contain respectively 10 20 30, and 40 50), 
> and then suppose shard 10 20 30 is read and then document 25 is inserted - 
> then the 3..5 shard will read 30 40 50, i.e. document 30 is duplicated and 
> document 25 is lost.
> - Every shard re-executes the query and skips the first start_offset items, 
> which in total is quadratic complexity
> - The query is first executed in the constructor in order to count results, 
> which 1) means the constructor can be super slow and 2) it won't work at all 
> if the database is unavailable at the time the pipeline is constructed (e.g. 
> if this is a template).
> Unfortunately, none of these issues are caught by SourceTestUtils: this class 
> has extensive coverage with it, and the tests pass. This is because the tests 
> return the same results in the same order. I don't know how to catch this 
> automatically, and I don't know how to catch the performance issue 
> automatically, but these would all be important follow-up items after the 
> actual fix.
> CC: [~chamikara] as reviewer.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (BEAM-7866) Python MongoDB IO performance and correctness issues

2019-08-14 Thread yifan zou (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yifan zou updated BEAM-7866:

Fix Version/s: (was: 2.16.0)
   2.15.0

> Python MongoDB IO performance and correctness issues
> 
>
> Key: BEAM-7866
> URL: https://issues.apache.org/jira/browse/BEAM-7866
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Eugene Kirpichov
>Assignee: Yichi Zhang
>Priority: Major
> Fix For: 2.15.0
>
>  Time Spent: 12h
>  Remaining Estimate: 0h
>
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/mongodbio.py
>  splits the query result by computing number of results in constructor, and 
> then in each reader re-executing the whole query and getting an index 
> sub-range of those results.
> This is broken in several critical ways:
> - The order of query results returned by find() is not necessarily 
> deterministic, so the idea of index ranges on it is meaningless: each shard 
> may basically get random, possibly overlapping subsets of the total results
> - Even if you add order by `_id`, the database may be changing concurrently 
> to reading and splitting. E.g. if the database contained documents with ids 
> 10 20 30 40 50, and this was split into shards 0..2 and 3..5 (under the 
> assumption that these shards would contain respectively 10 20 30, and 40 50), 
> and then suppose shard 10 20 30 is read and then document 25 is inserted - 
> then the 3..5 shard will read 30 40 50, i.e. document 30 is duplicated and 
> document 25 is lost.
> - Every shard re-executes the query and skips the first start_offset items, 
> which in total is quadratic complexity
> - The query is first executed in the constructor in order to count results, 
> which 1) means the constructor can be super slow and 2) it won't work at all 
> if the database is unavailable at the time the pipeline is constructed (e.g. 
> if this is a template).
> Unfortunately, none of these issues are caught by SourceTestUtils: this class 
> has extensive coverage with it, and the tests pass. This is because the tests 
> return the same results in the same order. I don't know how to catch this 
> automatically, and I don't know how to catch the performance issue 
> automatically, but these would all be important follow-up items after the 
> actual fix.
> CC: [~chamikara] as reviewer.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7866) Python MongoDB IO performance and correctness issues

2019-08-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7866?focusedWorklogId=295037=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-295037
 ]

ASF GitHub Bot logged work on BEAM-7866:


Author: ASF GitHub Bot
Created on: 14/Aug/19 21:33
Start Date: 14/Aug/19 21:33
Worklog Time Spent: 10m 
  Work Description: yifanzou commented on pull request #9342: 
[BEAM-7866][BEAM-5148] Cherry-picks mongodb fixes to 2.15.0 release branch
URL: https://github.com/apache/beam/pull/9342
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 295037)
Time Spent: 12h  (was: 11h 50m)

> Python MongoDB IO performance and correctness issues
> 
>
> Key: BEAM-7866
> URL: https://issues.apache.org/jira/browse/BEAM-7866
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Eugene Kirpichov
>Assignee: Yichi Zhang
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 12h
>  Remaining Estimate: 0h
>
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/mongodbio.py
>  splits the query result by computing number of results in constructor, and 
> then in each reader re-executing the whole query and getting an index 
> sub-range of those results.
> This is broken in several critical ways:
> - The order of query results returned by find() is not necessarily 
> deterministic, so the idea of index ranges on it is meaningless: each shard 
> may basically get random, possibly overlapping subsets of the total results
> - Even if you add order by `_id`, the database may be changing concurrently 
> to reading and splitting. E.g. if the database contained documents with ids 
> 10 20 30 40 50, and this was split into shards 0..2 and 3..5 (under the 
> assumption that these shards would contain respectively 10 20 30, and 40 50), 
> and then suppose shard 10 20 30 is read and then document 25 is inserted - 
> then the 3..5 shard will read 30 40 50, i.e. document 30 is duplicated and 
> document 25 is lost.
> - Every shard re-executes the query and skips the first start_offset items, 
> which in total is quadratic complexity
> - The query is first executed in the constructor in order to count results, 
> which 1) means the constructor can be super slow and 2) it won't work at all 
> if the database is unavailable at the time the pipeline is constructed (e.g. 
> if this is a template).
> Unfortunately, none of these issues are caught by SourceTestUtils: this class 
> has extensive coverage with it, and the tests pass. This is because the tests 
> return the same results in the same order. I don't know how to catch this 
> automatically, and I don't know how to catch the performance issue 
> automatically, but these would all be important follow-up items after the 
> actual fix.
> CC: [~chamikara] as reviewer.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7866) Python MongoDB IO performance and correctness issues

2019-08-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7866?focusedWorklogId=295036=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-295036
 ]

ASF GitHub Bot logged work on BEAM-7866:


Author: ASF GitHub Bot
Created on: 14/Aug/19 21:32
Start Date: 14/Aug/19 21:32
Worklog Time Spent: 10m 
  Work Description: yifanzou commented on issue #9342: 
[BEAM-7866][BEAM-5148] Cherry-picks mongodb fixes to 2.15.0 release branch
URL: https://github.com/apache/beam/pull/9342#issuecomment-521427155
 
 
   LGTM
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 295036)
Time Spent: 11h 50m  (was: 11h 40m)

> Python MongoDB IO performance and correctness issues
> 
>
> Key: BEAM-7866
> URL: https://issues.apache.org/jira/browse/BEAM-7866
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Eugene Kirpichov
>Assignee: Yichi Zhang
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 11h 50m
>  Remaining Estimate: 0h
>
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/mongodbio.py
>  splits the query result by computing number of results in constructor, and 
> then in each reader re-executing the whole query and getting an index 
> sub-range of those results.
> This is broken in several critical ways:
> - The order of query results returned by find() is not necessarily 
> deterministic, so the idea of index ranges on it is meaningless: each shard 
> may basically get random, possibly overlapping subsets of the total results
> - Even if you add order by `_id`, the database may be changing concurrently 
> to reading and splitting. E.g. if the database contained documents with ids 
> 10 20 30 40 50, and this was split into shards 0..2 and 3..5 (under the 
> assumption that these shards would contain respectively 10 20 30, and 40 50), 
> and then suppose shard 10 20 30 is read and then document 25 is inserted - 
> then the 3..5 shard will read 30 40 50, i.e. document 30 is duplicated and 
> document 25 is lost.
> - Every shard re-executes the query and skips the first start_offset items, 
> which in total is quadratic complexity
> - The query is first executed in the constructor in order to count results, 
> which 1) means the constructor can be super slow and 2) it won't work at all 
> if the database is unavailable at the time the pipeline is constructed (e.g. 
> if this is a template).
> Unfortunately, none of these issues are caught by SourceTestUtils: this class 
> has extensive coverage with it, and the tests pass. This is because the tests 
> return the same results in the same order. I don't know how to catch this 
> automatically, and I don't know how to catch the performance issue 
> automatically, but these would all be important follow-up items after the 
> actual fix.
> CC: [~chamikara] as reviewer.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7866) Python MongoDB IO performance and correctness issues

2019-08-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7866?focusedWorklogId=295023=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-295023
 ]

ASF GitHub Bot logged work on BEAM-7866:


Author: ASF GitHub Bot
Created on: 14/Aug/19 21:09
Start Date: 14/Aug/19 21:09
Worklog Time Spent: 10m 
  Work Description: chamikaramj commented on issue #9342: 
[BEAM-7866][BEAM-5148] Cherry-picks mongodb fixes to 2.15.0 release branch
URL: https://github.com/apache/beam/pull/9342#issuecomment-521419857
 
 
   R: @yifanzou 
   
   CC: @y1chi @aaltay 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 295023)
Time Spent: 11h 40m  (was: 11.5h)

> Python MongoDB IO performance and correctness issues
> 
>
> Key: BEAM-7866
> URL: https://issues.apache.org/jira/browse/BEAM-7866
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Eugene Kirpichov
>Assignee: Yichi Zhang
>Priority: Major
> Fix For: 2.16.0
>
>  Time Spent: 11h 40m
>  Remaining Estimate: 0h
>
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/mongodbio.py
>  splits the query result by computing number of results in constructor, and 
> then in each reader re-executing the whole query and getting an index 
> sub-range of those results.
> This is broken in several critical ways:
> - The order of query results returned by find() is not necessarily 
> deterministic, so the idea of index ranges on it is meaningless: each shard 
> may basically get random, possibly overlapping subsets of the total results
> - Even if you add order by `_id`, the database may be changing concurrently 
> to reading and splitting. E.g. if the database contained documents with ids 
> 10 20 30 40 50, and this was split into shards 0..2 and 3..5 (under the 
> assumption that these shards would contain respectively 10 20 30, and 40 50), 
> and then suppose shard 10 20 30 is read and then document 25 is inserted - 
> then the 3..5 shard will read 30 40 50, i.e. document 30 is duplicated and 
> document 25 is lost.
> - Every shard re-executes the query and skips the first start_offset items, 
> which in total is quadratic complexity
> - The query is first executed in the constructor in order to count results, 
> which 1) means the constructor can be super slow and 2) it won't work at all 
> if the database is unavailable at the time the pipeline is constructed (e.g. 
> if this is a template).
> Unfortunately, none of these issues are caught by SourceTestUtils: this class 
> has extensive coverage with it, and the tests pass. This is because the tests 
> return the same results in the same order. I don't know how to catch this 
> automatically, and I don't know how to catch the performance issue 
> automatically, but these would all be important follow-up items after the 
> actual fix.
> CC: [~chamikara] as reviewer.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7866) Python MongoDB IO performance and correctness issues

2019-08-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7866?focusedWorklogId=295022=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-295022
 ]

ASF GitHub Bot logged work on BEAM-7866:


Author: ASF GitHub Bot
Created on: 14/Aug/19 21:09
Start Date: 14/Aug/19 21:09
Worklog Time Spent: 10m 
  Work Description: chamikaramj commented on pull request #9342: 
[BEAM-7866][BEAM-5148] Cherry-picks mongodb fixes to 2.15.0 release branch
URL: https://github.com/apache/beam/pull/9342
 
 
   **Please** add a meaningful description for your change here
   
   
   
   Thank you for your contribution! Follow this checklist to help us 
incorporate your contribution quickly and easily:
   
- [ ] [**Choose 
reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and 
mention them in a comment (`R: @username`).
- [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in 
ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA 
issue, if applicable. This will automatically link the pull request to the 
issue.
- [ ] If this contribution is large, please file an Apache [Individual 
Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   Post-Commit Tests Status (on master branch)
   

   
   Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark
   --- | --- | --- | --- | --- | --- | --- | ---
   Go | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/)
 | --- | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/)
 | --- | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/)
   Java | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/)
   Python | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/)[![Build
 

[jira] [Work logged] (BEAM-7969) Streaming Dataflow worker doesn't report FnAPI metrics.

2019-08-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7969?focusedWorklogId=295016=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-295016
 ]

ASF GitHub Bot logged work on BEAM-7969:


Author: ASF GitHub Bot
Created on: 14/Aug/19 21:02
Start Date: 14/Aug/19 21:02
Worklog Time Spent: 10m 
  Work Description: angoenka commented on pull request #9330: [BEAM-7969] 
Report FnAPI counters as deltas in streaming jobs.
URL: https://github.com/apache/beam/pull/9330#discussion_r314081530
 
 

 ##
 File path: 
runners/google-cloud-dataflow-java/worker/src/main/java/org/apache/beam/runners/dataflow/worker/StreamingDataflowWorker.java
 ##
 @@ -1867,6 +1873,28 @@ private void sendWorkerUpdatesToDataflowService(
   cumulativeCounters.extractUpdates(false, 
DataflowCounterUpdateExtractor.INSTANCE));
   counterUpdates.addAll(
   
deltaCounters.extractModifiedDeltaUpdates(DataflowCounterUpdateExtractor.INSTANCE));
+  if (hasExperiment(options, "beam_fn_api")) {
+while (!this.pendingMonitoringInfos.isEmpty()) {
+  final CounterUpdate item = this.pendingMonitoringInfos.poll();
+
+  // This change will treat counter as delta.
+  // This is required because we receive cumulative results from FnAPI 
harness,
+  // while streaming job is expected to receive delta updates to 
counters on same
+  // WorkItem.
+  if (item.getCumulative()) {
+item.setCumulative(false);
+  } else {
+// In current world all counters coming from FnAPI are cumulative.
+// This is a safety check in case new counter type appears in 
FnAPI.
+throw new UnsupportedOperationException(
 
 Review comment:
   Shall we log the error and move on?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 295016)
Time Spent: 1.5h  (was: 1h 20m)

> Streaming Dataflow worker doesn't report FnAPI metrics.
> ---
>
> Key: BEAM-7969
> URL: https://issues.apache.org/jira/browse/BEAM-7969
> Project: Beam
>  Issue Type: Bug
>  Components: java-fn-execution, runner-dataflow
>Reporter: Mikhail Gryzykhin
>Assignee: Mikhail Gryzykhin
>Priority: Major
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> EOM



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7969) Streaming Dataflow worker doesn't report FnAPI metrics.

2019-08-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7969?focusedWorklogId=295015=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-295015
 ]

ASF GitHub Bot logged work on BEAM-7969:


Author: ASF GitHub Bot
Created on: 14/Aug/19 21:02
Start Date: 14/Aug/19 21:02
Worklog Time Spent: 10m 
  Work Description: angoenka commented on pull request #9330: [BEAM-7969] 
Report FnAPI counters as deltas in streaming jobs.
URL: https://github.com/apache/beam/pull/9330#discussion_r314081113
 
 

 ##
 File path: 
runners/google-cloud-dataflow-java/worker/src/main/java/org/apache/beam/runners/dataflow/worker/StreamingDataflowWorker.java
 ##
 @@ -1867,6 +1873,28 @@ private void sendWorkerUpdatesToDataflowService(
   cumulativeCounters.extractUpdates(false, 
DataflowCounterUpdateExtractor.INSTANCE));
   counterUpdates.addAll(
   
deltaCounters.extractModifiedDeltaUpdates(DataflowCounterUpdateExtractor.INSTANCE));
+  if (hasExperiment(options, "beam_fn_api")) {
+while (!this.pendingMonitoringInfos.isEmpty()) {
+  final CounterUpdate item = this.pendingMonitoringInfos.poll();
 
 Review comment:
   Lets also check for item == null to avoid possible race condition as its 
called at start, via a timer and at stop.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 295015)
Time Spent: 1h 20m  (was: 1h 10m)

> Streaming Dataflow worker doesn't report FnAPI metrics.
> ---
>
> Key: BEAM-7969
> URL: https://issues.apache.org/jira/browse/BEAM-7969
> Project: Beam
>  Issue Type: Bug
>  Components: java-fn-execution, runner-dataflow
>Reporter: Mikhail Gryzykhin
>Assignee: Mikhail Gryzykhin
>Priority: Major
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> EOM



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (BEAM-7980) External environment with containerized worker pool

2019-08-14 Thread Thomas Weise (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Weise updated BEAM-7980:
---
Description: 
Augment Beam Python docker image and boot.go so that it can be used to launch 
BeamFnExternalWorkerPoolServicer.

[https://docs.google.com/document/d/1z3LNrRtr8kkiFHonZ5JJM_L4NWNBBNcqRc_yAf6G0VI/edit#heading=h.lnhm75dhvhi0]

 

> External environment with containerized worker pool
> ---
>
> Key: BEAM-7980
> URL: https://issues.apache.org/jira/browse/BEAM-7980
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py-harness
>Reporter: Thomas Weise
>Assignee: Thomas Weise
>Priority: Major
>
> Augment Beam Python docker image and boot.go so that it can be used to launch 
> BeamFnExternalWorkerPoolServicer.
> [https://docs.google.com/document/d/1z3LNrRtr8kkiFHonZ5JJM_L4NWNBBNcqRc_yAf6G0VI/edit#heading=h.lnhm75dhvhi0]
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (BEAM-7980) External environment with containerized worker pool

2019-08-14 Thread Thomas Weise (JIRA)
Thomas Weise created BEAM-7980:
--

 Summary: External environment with containerized worker pool
 Key: BEAM-7980
 URL: https://issues.apache.org/jira/browse/BEAM-7980
 Project: Beam
  Issue Type: Improvement
  Components: sdk-py-harness
Reporter: Thomas Weise
Assignee: Thomas Weise






--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (BEAM-7667) report GCS throttling time to Dataflow autoscaler

2019-08-14 Thread Heejong Lee (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Heejong Lee resolved BEAM-7667.
---
   Resolution: Fixed
Fix Version/s: 2.15.0

> report GCS throttling time to Dataflow autoscaler
> -
>
> Key: BEAM-7667
> URL: https://issues.apache.org/jira/browse/BEAM-7667
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-dataflow
>Reporter: Heejong Lee
>Assignee: Heejong Lee
>Priority: Major
> Fix For: 2.15.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> report GCS throttling time to Dataflow autoscaler.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-720) Run WindowedWordCount Integration Test in Flink

2019-08-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-720?focusedWorklogId=295008=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-295008
 ]

ASF GitHub Bot logged work on BEAM-720:
---

Author: ASF GitHub Bot
Created on: 14/Aug/19 20:28
Start Date: 14/Aug/19 20:28
Worklog Time Spent: 10m 
  Work Description: coveralls commented on issue #2662: [BEAM-720] 
Re-enable WindowedWordCountIT on Flink runner in precommit
URL: https://github.com/apache/beam/pull/2662#issuecomment-298197876
 
 
   
   [![Coverage 
Status](https://coveralls.io/builds/25140914/badge)](https://coveralls.io/builds/25140914)
   
   Coverage increased (+22.1%) to 91.918% when pulling 
**e2ebd9e191baddbb0d8486c676f625a3ff8b63ba on 
kennknowles:Flink-WindowedWordCountIT** into 
**f5e3f5230af35da5a03ba9740f087b0f22df6dca on apache:master**.
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 295008)
Time Spent: 10m
Remaining Estimate: 0h

> Run WindowedWordCount Integration Test in Flink
> ---
>
> Key: BEAM-720
> URL: https://issues.apache.org/jira/browse/BEAM-720
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-flink, testing
>Reporter: Mark Liu
>Assignee: Kenneth Knowles
>Priority: Major
> Fix For: 2.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In order to have coverage of streaming pipeline test in pre-commit, it's 
> important to have TestFlinkRunner to be able to run WindowedWordCountIT 
> successfully. 
> Relevant works in TestDataflowRunner:
> https://github.com/apache/incubator-beam/pull/1045



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (BEAM-5286) [beam_PostCommit_Java_GradleBuild][org.apache.beam.examples.subprocess.ExampleEchoPipelineTest.testExampleEchoPipeline][Flake] .sh script: text file busy.

2019-08-14 Thread Andrew Pilloud (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-5286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907595#comment-16907595
 ] 

Andrew Pilloud commented on BEAM-5286:
--

This is marked as both "currently-failing" and "flake". It isn't getting much 
activity so I'm guessing it is actually just a "flake".

> [beam_PostCommit_Java_GradleBuild][org.apache.beam.examples.subprocess.ExampleEchoPipelineTest.testExampleEchoPipeline][Flake]
>  .sh script: text file busy.
> --
>
> Key: BEAM-5286
> URL: https://issues.apache.org/jira/browse/BEAM-5286
> Project: Beam
>  Issue Type: Bug
>  Components: test-failures
>Reporter: Boyuan Zhang
>Assignee: Alan Myrvold
>Priority: Major
>  Labels: flake
> Fix For: Not applicable
>
>
> Sample failure: 
> [https://builds.apache.org/job/beam_PostCommit_Java_GradleBuild/1375/testReport/junit/org.apache.beam.examples.subprocess/ExampleEchoPipelineTest/testExampleEchoPipeline/]
> Sample relevant log:
> org.apache.beam.sdk.Pipeline$PipelineExecutionException: java.lang.Exception: 
> java.io.IOException: Cannot run program 
> "/tmp/test-Echoo1519764280436328522/test-EchoAgain3143210610074994370.sh": 
> error=26, Text file busy



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (BEAM-5286) [beam_PostCommit_Java_GradleBuild][org.apache.beam.examples.subprocess.ExampleEchoPipelineTest.testExampleEchoPipeline][Flake] .sh script: text file busy.

2019-08-14 Thread Andrew Pilloud (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-5286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Pilloud updated BEAM-5286:
-
Labels: flake  (was: currently-failing flake)

> [beam_PostCommit_Java_GradleBuild][org.apache.beam.examples.subprocess.ExampleEchoPipelineTest.testExampleEchoPipeline][Flake]
>  .sh script: text file busy.
> --
>
> Key: BEAM-5286
> URL: https://issues.apache.org/jira/browse/BEAM-5286
> Project: Beam
>  Issue Type: Bug
>  Components: test-failures
>Reporter: Boyuan Zhang
>Assignee: Alan Myrvold
>Priority: Major
>  Labels: flake
> Fix For: Not applicable
>
>
> Sample failure: 
> [https://builds.apache.org/job/beam_PostCommit_Java_GradleBuild/1375/testReport/junit/org.apache.beam.examples.subprocess/ExampleEchoPipelineTest/testExampleEchoPipeline/]
> Sample relevant log:
> org.apache.beam.sdk.Pipeline$PipelineExecutionException: java.lang.Exception: 
> java.io.IOException: Cannot run program 
> "/tmp/test-Echoo1519764280436328522/test-EchoAgain3143210610074994370.sh": 
> error=26, Text file busy



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7670) Flink portable worker gets stuck if one of the task does not get any data

2019-08-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7670?focusedWorklogId=294986=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-294986
 ]

ASF GitHub Bot logged work on BEAM-7670:


Author: ASF GitHub Bot
Created on: 14/Aug/19 19:55
Start Date: 14/Aug/19 19:55
Worklog Time Spent: 10m 
  Work Description: angoenka commented on issue #8992: [BEAM-7670] portable 
py: prevent race opening worker subprocess
URL: https://github.com/apache/beam/pull/8992#issuecomment-521394319
 
 
   Run Python PreCommit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 294986)
Time Spent: 2h 50m  (was: 2h 40m)

> Flink portable worker gets stuck if one of the task does not get any data
> -
>
> Key: BEAM-7670
> URL: https://issues.apache.org/jira/browse/BEAM-7670
> Project: Beam
>  Issue Type: Bug
>  Components: runner-flink
>Reporter: Ankur Goenka
>Priority: Major
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> When using parallelism > 1 with flink portable runner, the job gets stuck if 
> the data is partitioned in such a way that one of the task does not get any 
> data.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (BEAM-6610) [beam_PostCommit_Python_Verify, beam_PostCommit_Java]ResourceExhausted: 429 Your project has exceeded a limit: (type="topics-per-project", current=10000, maximum=10000)

2019-08-14 Thread Andrew Pilloud (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-6610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Pilloud updated BEAM-6610:
-
Labels: mitigated  (was: currently-failing mitigated)

> [beam_PostCommit_Python_Verify, beam_PostCommit_Java]ResourceExhausted: 429 
> Your project has exceeded a limit: (type="topics-per-project", current=1, 
> maximum=1)
> 
>
> Key: BEAM-6610
> URL: https://issues.apache.org/jira/browse/BEAM-6610
> Project: Beam
>  Issue Type: New Feature
>  Components: test-failures
>Reporter: Mikhail Gryzykhin
>Assignee: Kenneth Knowles
>Priority: Major
>  Labels: mitigated
>
> Job:
> [https://builds.apache.org/job/beam_PostCommit_Python_Verify/7313/console]
> [https://builds.apache.org/job/beam_PostCommit_Java/2542/testReport/org.apache.beam.sdk.io.gcp.pubsub/PubsubReadIT/testReadPublicData/]
> Gradle target:
> *:beam-sdks-python:postCommitIT*
> Error:
>  *04:44:17* Traceback (most recent call last):*04:44:17* File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/src/sdks/python/apache_beam/io/gcp/pubsub_integration_test.py",
>  line 108, in setUp*04:44:17* self.pub_client.topic_path(self.project, 
> INPUT_TOPIC + self.uuid))*04:44:17* File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/src/build/gradleenv/1327086738/local/lib/python2.7/site-packages/google/cloud/pubsub_v1/_gapic.py",
>  line 44, in *04:44:17* fx = lambda self, *a, **kw: 
> wrapped_fx(self.api, *a, **kw) # noqa*04:44:17* File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/src/build/gradleenv/1327086738/local/lib/python2.7/site-packages/google/cloud/pubsub_v1/gapic/publisher_client.py",
>  line 271, in create_topic*04:44:17* request, retry=retry, timeout=timeout, 
> metadata=metadata)*04:44:17* File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/src/build/gradleenv/1327086738/local/lib/python2.7/site-packages/google/api_core/gapic_v1/method.py",
>  line 143, in __call__*04:44:17* return wrapped_func(*args, 
> **kwargs)*04:44:17* File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/src/build/gradleenv/1327086738/local/lib/python2.7/site-packages/google/api_core/retry.py",
>  line 270, in retry_wrapped_func*04:44:17* on_error=on_error,*04:44:17* File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/src/build/gradleenv/1327086738/local/lib/python2.7/site-packages/google/api_core/retry.py",
>  line 179, in retry_target*04:44:17* return target()*04:44:17* File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/src/build/gradleenv/1327086738/local/lib/python2.7/site-packages/google/api_core/timeout.py",
>  line 214, in func_with_timeout*04:44:17* return func(*args, 
> **kwargs)*04:44:17* File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/src/build/gradleenv/1327086738/local/lib/python2.7/site-packages/google/api_core/grpc_helpers.py",
>  line 59, in error_remapped_callable*04:44:17* 
> six.raise_from(exceptions.from_grpc_error(exc), exc)*04:44:17* File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/src/build/gradleenv/1327086738/local/lib/python2.7/site-packages/six.py",
>  line 737, in raise_from*04:44:17* raise value*04:44:17* ResourceExhausted: 
> 429 Your project has exceeded a limit: (type="topics-per-project", 
> current=1, maximum=1).



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (BEAM-6610) [beam_PostCommit_Python_Verify, beam_PostCommit_Java]ResourceExhausted: 429 Your project has exceeded a limit: (type="topics-per-project", current=10000, maximum=10000)

2019-08-14 Thread Andrew Pilloud (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-6610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907586#comment-16907586
 ] 

Andrew Pilloud commented on BEAM-6610:
--

It is unlikely that this is both "currently-failing" and "mitigated". I'm going 
to remove "currently-failing".

> [beam_PostCommit_Python_Verify, beam_PostCommit_Java]ResourceExhausted: 429 
> Your project has exceeded a limit: (type="topics-per-project", current=1, 
> maximum=1)
> 
>
> Key: BEAM-6610
> URL: https://issues.apache.org/jira/browse/BEAM-6610
> Project: Beam
>  Issue Type: New Feature
>  Components: test-failures
>Reporter: Mikhail Gryzykhin
>Assignee: Kenneth Knowles
>Priority: Major
>  Labels: currently-failing, mitigated
>
> Job:
> [https://builds.apache.org/job/beam_PostCommit_Python_Verify/7313/console]
> [https://builds.apache.org/job/beam_PostCommit_Java/2542/testReport/org.apache.beam.sdk.io.gcp.pubsub/PubsubReadIT/testReadPublicData/]
> Gradle target:
> *:beam-sdks-python:postCommitIT*
> Error:
>  *04:44:17* Traceback (most recent call last):*04:44:17* File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/src/sdks/python/apache_beam/io/gcp/pubsub_integration_test.py",
>  line 108, in setUp*04:44:17* self.pub_client.topic_path(self.project, 
> INPUT_TOPIC + self.uuid))*04:44:17* File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/src/build/gradleenv/1327086738/local/lib/python2.7/site-packages/google/cloud/pubsub_v1/_gapic.py",
>  line 44, in *04:44:17* fx = lambda self, *a, **kw: 
> wrapped_fx(self.api, *a, **kw) # noqa*04:44:17* File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/src/build/gradleenv/1327086738/local/lib/python2.7/site-packages/google/cloud/pubsub_v1/gapic/publisher_client.py",
>  line 271, in create_topic*04:44:17* request, retry=retry, timeout=timeout, 
> metadata=metadata)*04:44:17* File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/src/build/gradleenv/1327086738/local/lib/python2.7/site-packages/google/api_core/gapic_v1/method.py",
>  line 143, in __call__*04:44:17* return wrapped_func(*args, 
> **kwargs)*04:44:17* File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/src/build/gradleenv/1327086738/local/lib/python2.7/site-packages/google/api_core/retry.py",
>  line 270, in retry_wrapped_func*04:44:17* on_error=on_error,*04:44:17* File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/src/build/gradleenv/1327086738/local/lib/python2.7/site-packages/google/api_core/retry.py",
>  line 179, in retry_target*04:44:17* return target()*04:44:17* File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/src/build/gradleenv/1327086738/local/lib/python2.7/site-packages/google/api_core/timeout.py",
>  line 214, in func_with_timeout*04:44:17* return func(*args, 
> **kwargs)*04:44:17* File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/src/build/gradleenv/1327086738/local/lib/python2.7/site-packages/google/api_core/grpc_helpers.py",
>  line 59, in error_remapped_callable*04:44:17* 
> six.raise_from(exceptions.from_grpc_error(exc), exc)*04:44:17* File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/src/build/gradleenv/1327086738/local/lib/python2.7/site-packages/six.py",
>  line 737, in raise_from*04:44:17* raise value*04:44:17* ResourceExhausted: 
> 429 Your project has exceeded a limit: (type="topics-per-project", 
> current=1, maximum=1).



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (BEAM-7225) [beam_PostCommit_Py_ValCont ] [test_metrics_fnapi_it] Unable to match metrics for matcher namespace

2019-08-14 Thread Andrew Pilloud (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-7225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907585#comment-16907585
 ] 

Andrew Pilloud commented on BEAM-7225:
--

This is marked as 'currently-failing', but the test appears to be passing. I'm 
removing the label, please add back if that is incorrect.

[https://builds.apache.org/job/beam_PostCommit_Py_ValCont/]

> [beam_PostCommit_Py_ValCont ] [test_metrics_fnapi_it] Unable to match metrics 
> for matcher  namespace
> 
>
> Key: BEAM-7225
> URL: https://issues.apache.org/jira/browse/BEAM-7225
> Project: Beam
>  Issue Type: Bug
>  Components: test-failures
>Reporter: Ahmet Altay
>Assignee: Alex Amato
>Priority: Major
>  Labels: currently-failing
>
> https://builds.apache.org/job/beam_PostCommit_Py_ValCont/3115/console
> 13:19:41 test_metrics_fnapi_it 
> (apache_beam.runners.dataflow.dataflow_exercise_metrics_pipeline_test.ExerciseMetricsPipelineTest)
>  ... FAIL
> 13:19:41 
> 13:19:41 
> ==
> 13:19:41 FAIL: test_metrics_fnapi_it 
> (apache_beam.runners.dataflow.dataflow_exercise_metrics_pipeline_test.ExerciseMetricsPipelineTest)
> 13:19:41 
> --
> 13:19:41 Traceback (most recent call last):
> 13:19:41   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Py_ValCont/src/sdks/python/apache_beam/runners/dataflow/dataflow_exercise_metrics_pipeline_test.py",
>  line 70, in test_metrics_fnapi_it
> 13:19:41 self.assertFalse(errors, str(errors))
> 13:19:41 AssertionError: Unable to match metrics for matcher  namespace: 
> 'apache_beam.runners.dataflow.dataflow_exercise_metrics_pipeline.UserMetricsDoFn'
>  name: 'total_values' step: 'metrics' attempted: <100> committed: <100>
> 13:19:41 Actual MetricResults:



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (BEAM-7225) [beam_PostCommit_Py_ValCont ] [test_metrics_fnapi_it] Unable to match metrics for matcher namespace

2019-08-14 Thread Andrew Pilloud (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Pilloud updated BEAM-7225:
-
Labels:   (was: currently-failing)

> [beam_PostCommit_Py_ValCont ] [test_metrics_fnapi_it] Unable to match metrics 
> for matcher  namespace
> 
>
> Key: BEAM-7225
> URL: https://issues.apache.org/jira/browse/BEAM-7225
> Project: Beam
>  Issue Type: Bug
>  Components: test-failures
>Reporter: Ahmet Altay
>Assignee: Alex Amato
>Priority: Major
>
> https://builds.apache.org/job/beam_PostCommit_Py_ValCont/3115/console
> 13:19:41 test_metrics_fnapi_it 
> (apache_beam.runners.dataflow.dataflow_exercise_metrics_pipeline_test.ExerciseMetricsPipelineTest)
>  ... FAIL
> 13:19:41 
> 13:19:41 
> ==
> 13:19:41 FAIL: test_metrics_fnapi_it 
> (apache_beam.runners.dataflow.dataflow_exercise_metrics_pipeline_test.ExerciseMetricsPipelineTest)
> 13:19:41 
> --
> 13:19:41 Traceback (most recent call last):
> 13:19:41   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Py_ValCont/src/sdks/python/apache_beam/runners/dataflow/dataflow_exercise_metrics_pipeline_test.py",
>  line 70, in test_metrics_fnapi_it
> 13:19:41 self.assertFalse(errors, str(errors))
> 13:19:41 AssertionError: Unable to match metrics for matcher  namespace: 
> 'apache_beam.runners.dataflow.dataflow_exercise_metrics_pipeline.UserMetricsDoFn'
>  name: 'total_values' step: 'metrics' attempted: <100> committed: <100>
> 13:19:41 Actual MetricResults:



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (BEAM-7619) [beam_PostCommit_Py_ValCont] [test_metrics_fnapi_it] KeyError

2019-08-14 Thread Andrew Pilloud (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-7619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907578#comment-16907578
 ] 

Andrew Pilloud commented on BEAM-7619:
--

Hello, this is marked as "currently-failing". Is this still causing tests to 
fail?

> [beam_PostCommit_Py_ValCont] [test_metrics_fnapi_it] KeyError
> -
>
> Key: BEAM-7619
> URL: https://issues.apache.org/jira/browse/BEAM-7619
> Project: Beam
>  Issue Type: Bug
>  Components: test-failures
>Reporter: Mikhail Gryzykhin
>Assignee: Valentyn Tymofieiev
>Priority: Major
>  Labels: currently-failing
>
> _Use this form to file an issue for test failure:_
>  * [Jenkins 
> Job|[https://builds.apache.org/job/beam_PostCommit_Py_ValCont/3625/consoleFull]]
>  
> Initial investigation:
> 12:08:13 ERROR: test_wordcount_fnapi_it 
> (apache_beam.examples.wordcount_it_test.WordCountIT)
> 12:08:13 
> --
> 12:08:13 Traceback (most recent call last):
> 12:08:13 File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Py_ValCont/src/sdks/python/apache_beam/examples/wordcount_it_test.py",
>  line 52, in test_wordcount_fnapi_it
> 12:08:13 self._run_wordcount_it(wordcount.run, experiment='beam_fn_api')
> 12:08:13 File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Py_ValCont/src/sdks/python/apache_beam/examples/wordcount_it_test.py",
>  line 84, in _run_wordcount_it
> 12:08:13 run_wordcount(test_pipeline.get_full_options_as_args(**extra_opts))
> 12:08:13 File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Py_ValCont/src/sdks/python/apache_beam/examples/wordcount.py",
>  line 114, in run
> 12:08:13 result = p.run()
> 12:08:13 File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Py_ValCont/src/sdks/python/apache_beam/pipeline.py",
>  line 406, in run
> 12:08:13 self._options).run(False)
> 12:08:13 File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Py_ValCont/src/sdks/python/apache_beam/pipeline.py",
>  line 419, in run
> 12:08:13 return self.runner.run_pipeline(self, self._options)
> 12:08:13 File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Py_ValCont/src/sdks/python/apache_beam/runners/dataflow/test_dataflow_runner.py",
>  line 64, in run_pipeline
> 12:08:13 self.result.wait_until_finish(duration=wait_duration)
> 12:08:13 File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Py_ValCont/src/sdks/python/apache_beam/runners/dataflow/dataflow_runner.py",
>  line 1338, in wait_until_finish
> 12:08:13 (self.state, getattr(self._runner, 'last_error_msg', None)), self)
> 12:08:13 DataflowRuntimeException: Dataflow pipeline failed. State: FAILED, 
> Error:
> 12:08:13 java.util.concurrent.ExecutionException: java.lang.RuntimeException: 
> Error received from SDK harness for instruction -129: Traceback (most recent 
> call last):
> 12:08:13 File 
> "/usr/local/lib/python2.7/site-packages/apache_beam/runners/worker/sdk_worker.py",
>  line 157, in _execute
> 12:08:13 response = task()
> 12:08:13 File 
> "/usr/local/lib/python2.7/site-packages/apache_beam/runners/worker/sdk_worker.py",
>  line 190, in 
> 12:08:13 self._execute(lambda: worker.do_instruction(work), work)
> 12:08:13 File 
> "/usr/local/lib/python2.7/site-packages/apache_beam/runners/worker/sdk_worker.py",
>  line 342, in do_instruction
> 12:08:13 request.instruction_id)
> 12:08:13 File 
> "/usr/local/lib/python2.7/site-packages/apache_beam/runners/worker/sdk_worker.py",
>  line 368, in process_bundle
> 12:08:13 bundle_processor.process_bundle(instruction_id))
> 12:08:13 File 
> "/usr/local/lib/python2.7/site-packages/apache_beam/runners/worker/bundle_processor.py",
>  line 593, in process_bundle
> 12:08:13 data.ptransform_id].process_encoded(data.data)
> 12:08:13 KeyError: u'\n\x04-107\x12\x04-105'
> 12:08:13 
> 12:08:13 at 
> java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
> 12:08:13 at 
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)
> 12:08:13 at org.apache.beam.sdk.util.MoreFutures.get(MoreFutures.java:57)
> 12:08:13 at 
> org.apache.beam.runners.dataflow.worker.fn.control.RegisterAndProcessBundleOperation.finish(RegisterAndProcessBundleOperation.java:285)
> 12:08:13 at 
> org.apache.beam.runners.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:85)
> 12:08:13 at 
> org.apache.beam.runners.dataflow.worker.fn.control.BeamFnMapTaskExecutor.execute(BeamFnMapTaskExecutor.java:125)
> 12:08:13 at 
> org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.executeWork(BatchDataflowWorker.java:412)
> 12:08:13 at 
> org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.doWork(BatchDataflowWorker.java:381)
> 12:08:13 at 
> 

[jira] [Commented] (BEAM-7410) Explore possibilities to lower in-use IP address quota footprint.

2019-08-14 Thread Andrew Pilloud (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-7410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907576#comment-16907576
 ] 

Andrew Pilloud commented on BEAM-7410:
--

This is marked as "currently-failing". Is it actually causing tests to fail 
right now?

> Explore possibilities to lower in-use IP address quota footprint.
> -
>
> Key: BEAM-7410
> URL: https://issues.apache.org/jira/browse/BEAM-7410
> Project: Beam
>  Issue Type: Bug
>  Components: test-failures
>Reporter: Mikhail Gryzykhin
>Assignee: Alan Myrvold
>Priority: Major
>  Labels: currently-failing
>
>  
> {noformat}
> java.lang.RuntimeException: Workflow failed. Causes: Project 
> apache-beam-testing has insufficient quota(s) to execute this workflow with 1 
> instances in region us-central1. Quota summary (required/available): 1/11743 
> instances, 1/170 CPUs, 250/278399 disk GB, 0/4046 SSD disk GB, 1/188 instance 
> groups, 1/188 managed instance groups, 1/126 instance templates, 1/0 in-use 
> IP addresses.{noformat}
>  
> Sample jobs:
> [https://builds.apache.org/job/beam_PostCommit_Java11_ValidatesRunner_Dataflow/440/testReport/junit/org.apache.beam.sdk.testing/PAssertTest/testGlobalWindowContainsInAnyOrder/]
> [https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_PR/67/testReport/junit/org.apache.beam.sdk.transforms/ViewTest/testEmptyIterableSideInput/]
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (BEAM-7843) apache-beam-jenkins-14 unhealthy ?

2019-08-14 Thread Andrew Pilloud (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Pilloud resolved BEAM-7843.
--
   Resolution: Fixed
Fix Version/s: Not applicable

> apache-beam-jenkins-14 unhealthy ?
> --
>
> Key: BEAM-7843
> URL: https://issues.apache.org/jira/browse/BEAM-7843
> Project: Beam
>  Issue Type: Bug
>  Components: test-failures, testing
>Reporter: Ahmet Altay
>Assignee: yifan zou
>Priority: Major
>  Labels: currently-failing
> Fix For: Not applicable
>
>
> https://builds.apache.org/job/beam_PreCommit_Python_PVR_Flink_Commit/4042/consoleFull
> 09:33:34  at 
> org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandIn(CliGitAPIImpl.java:2042)
> 09:33:34  at 
> org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandIn(CliGitAPIImpl.java:2010)
> 09:33:34  at 
> org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandIn(CliGitAPIImpl.java:2006)
> 09:33:34  at 
> org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommand(CliGitAPIImpl.java:1638)
> 09:33:34  at 
> org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommand(CliGitAPIImpl.java:1650)
> 09:33:34  at 
> org.jenkinsci.plugins.gitclient.CliGitAPIImpl.clean(CliGitAPIImpl.java:828)
> 09:33:34  at hudson.plugins.git.GitAPI.clean(GitAPI.java:311)
> 09:33:34  at sun.reflect.GeneratedMethodAccessor62.invoke(Unknown Source)
> 09:33:34  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 09:33:34  at java.lang.reflect.Method.invoke(Method.java:498)
> 09:33:34  at 
> hudson.remoting.RemoteInvocationHandler$RPCRequest.perform(RemoteInvocationHandler.java:929)
> 09:33:34  at 
> hudson.remoting.RemoteInvocationHandler$RPCRequest.call(RemoteInvocationHandler.java:903)
> 09:33:34  at 
> hudson.remoting.RemoteInvocationHandler$RPCRequest.call(RemoteInvocationHandler.java:855)
> 09:33:34  at hudson.remoting.UserRequest.perform(UserRequest.java:212)
> 09:33:34  at hudson.remoting.UserRequest.perform(UserRequest.java:54)
> 09:33:34  at hudson.remoting.Request$2.run(Request.java:369)
> 09:33:34  at 
> hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72)
> 09:33:34  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> 09:33:34  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> 09:33:34  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> 09:33:34  at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:93)
> 09:33:34  at java.lang.Thread.run(Thread.java:748)
> 09:33:36 Setting status of 7cc4604ddecd3cc8819ccfc0fb307f4723494d4a to 
> FAILURE with url 
> https://builds.apache.org/job/beam_PreCommit_Python_PVR_Flink_Commit/4042/ 
> and message: 'FAILURE
> 09:33:36  '



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (BEAM-7843) apache-beam-jenkins-14 unhealthy ?

2019-08-14 Thread Andrew Pilloud (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-7843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907574#comment-16907574
 ] 

Andrew Pilloud commented on BEAM-7843:
--

This looks to have been fixed.

> apache-beam-jenkins-14 unhealthy ?
> --
>
> Key: BEAM-7843
> URL: https://issues.apache.org/jira/browse/BEAM-7843
> Project: Beam
>  Issue Type: Bug
>  Components: test-failures, testing
>Reporter: Ahmet Altay
>Assignee: yifan zou
>Priority: Major
>  Labels: currently-failing
>
> https://builds.apache.org/job/beam_PreCommit_Python_PVR_Flink_Commit/4042/consoleFull
> 09:33:34  at 
> org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandIn(CliGitAPIImpl.java:2042)
> 09:33:34  at 
> org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandIn(CliGitAPIImpl.java:2010)
> 09:33:34  at 
> org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandIn(CliGitAPIImpl.java:2006)
> 09:33:34  at 
> org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommand(CliGitAPIImpl.java:1638)
> 09:33:34  at 
> org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommand(CliGitAPIImpl.java:1650)
> 09:33:34  at 
> org.jenkinsci.plugins.gitclient.CliGitAPIImpl.clean(CliGitAPIImpl.java:828)
> 09:33:34  at hudson.plugins.git.GitAPI.clean(GitAPI.java:311)
> 09:33:34  at sun.reflect.GeneratedMethodAccessor62.invoke(Unknown Source)
> 09:33:34  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 09:33:34  at java.lang.reflect.Method.invoke(Method.java:498)
> 09:33:34  at 
> hudson.remoting.RemoteInvocationHandler$RPCRequest.perform(RemoteInvocationHandler.java:929)
> 09:33:34  at 
> hudson.remoting.RemoteInvocationHandler$RPCRequest.call(RemoteInvocationHandler.java:903)
> 09:33:34  at 
> hudson.remoting.RemoteInvocationHandler$RPCRequest.call(RemoteInvocationHandler.java:855)
> 09:33:34  at hudson.remoting.UserRequest.perform(UserRequest.java:212)
> 09:33:34  at hudson.remoting.UserRequest.perform(UserRequest.java:54)
> 09:33:34  at hudson.remoting.Request$2.run(Request.java:369)
> 09:33:34  at 
> hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72)
> 09:33:34  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> 09:33:34  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> 09:33:34  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> 09:33:34  at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:93)
> 09:33:34  at java.lang.Thread.run(Thread.java:748)
> 09:33:36 Setting status of 7cc4604ddecd3cc8819ccfc0fb307f4723494d4a to 
> FAILURE with url 
> https://builds.apache.org/job/beam_PreCommit_Python_PVR_Flink_Commit/4042/ 
> and message: 'FAILURE
> 09:33:36  '



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (BEAM-7806) org.apache.beam.sdk.extensions.sql.meta.provider.pubsub.PubsubJsonIT.testSelectsPayloadContent failed

2019-08-14 Thread Andrew Pilloud (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-7806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907572#comment-16907572
 ] 

Andrew Pilloud commented on BEAM-7806:
--

Based on the PR, this is no longer 'currently-failing'. Should it be closed as 
fixed as well?

> org.apache.beam.sdk.extensions.sql.meta.provider.pubsub.PubsubJsonIT.testSelectsPayloadContent
>  failed
> -
>
> Key: BEAM-7806
> URL: https://issues.apache.org/jira/browse/BEAM-7806
> Project: Beam
>  Issue Type: Bug
>  Components: test-failures
>Reporter: Boyuan Zhang
>Assignee: Kenneth Knowles
>Priority: Critical
>  Labels: currently-failing
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> First failure: https://builds.apache.org/job/beam_PostCommit_SQL/2135/



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (BEAM-7806) org.apache.beam.sdk.extensions.sql.meta.provider.pubsub.PubsubJsonIT.testSelectsPayloadContent failed

2019-08-14 Thread Andrew Pilloud (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Pilloud updated BEAM-7806:
-
Labels:   (was: currently-failing)

> org.apache.beam.sdk.extensions.sql.meta.provider.pubsub.PubsubJsonIT.testSelectsPayloadContent
>  failed
> -
>
> Key: BEAM-7806
> URL: https://issues.apache.org/jira/browse/BEAM-7806
> Project: Beam
>  Issue Type: Bug
>  Components: test-failures
>Reporter: Boyuan Zhang
>Assignee: Kenneth Knowles
>Priority: Critical
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> First failure: https://builds.apache.org/job/beam_PostCommit_SQL/2135/



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7386) Add Utility BiTemporalStreamJoin

2019-08-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7386?focusedWorklogId=294974=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-294974
 ]

ASF GitHub Bot logged work on BEAM-7386:


Author: ASF GitHub Bot
Created on: 14/Aug/19 19:34
Start Date: 14/Aug/19 19:34
Worklog Time Spent: 10m 
  Work Description: je-ik commented on pull request #9032: [BEAM-7386] 
Bi-Temporal Join
URL: https://github.com/apache/beam/pull/9032#discussion_r313952183
 
 

 ##
 File path: 
sdks/java/extensions/timeseries/src/main/java/org/apache/beam/sdk/extensions/timeseries/joins/BiTemporalStreams.java
 ##
 @@ -0,0 +1,464 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.beam.sdk.extensions.timeseries.joins;
+
+import java.util.*;
+import org.apache.beam.sdk.annotations.Experimental;
+import org.apache.beam.sdk.coders.Coder;
+import org.apache.beam.sdk.coders.KvCoder;
+import org.apache.beam.sdk.coders.VarLongCoder;
+import org.apache.beam.sdk.state.*;
+import org.apache.beam.sdk.state.Timer;
+import org.apache.beam.sdk.transforms.DoFn;
+import org.apache.beam.sdk.transforms.Flatten;
+import org.apache.beam.sdk.transforms.PTransform;
+import org.apache.beam.sdk.transforms.ParDo;
+import org.apache.beam.sdk.transforms.join.KeyedPCollectionTuple;
+import org.apache.beam.sdk.transforms.windowing.BoundedWindow;
+import org.apache.beam.sdk.transforms.windowing.FixedWindows;
+import org.apache.beam.sdk.transforms.windowing.PaneInfo;
+import org.apache.beam.sdk.transforms.windowing.Window;
+import org.apache.beam.sdk.util.WindowedValue;
+import org.apache.beam.sdk.values.*;
+import org.joda.time.Duration;
+import org.joda.time.Instant;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * This sample will take two streams left stream and right stream. It will 
match the left stream to
+ * the nearest right stream based on their timestamp. Nearest left stream is 
either <= to the right
+ * stream timestamp.
+ *
+ * If two values in the right steam have the same timestamp then the 
results are
+ * non-deterministic.
+ */
+@Experimental
+public class BiTemporalStreams {
+
+  public static  BiTemporalJoin join(
+  TupleTag leftTag, TupleTag rightTag, Duration window) {
+return new BiTemporalJoin<>(leftTag, rightTag, window);
+  }
+
+  @Experimental
+  public static class BiTemporalJoin
+  extends PTransform, 
PCollection>> {
+
+private static final Logger LOG = 
LoggerFactory.getLogger(BiTemporalStreams.class);
+
+// Sets the limit at which point the processed left stream values are 
garbage collected
+static int LEFT_STREAM_GC_LIMIT = 1000;
+
+Coder leftCoder;
+Coder rightCoder;
+
+TupleTag leftTag;
+TupleTag rightTag;
+
+Duration window;
+
+private BiTemporalJoin(TupleTag leftTag, TupleTag rightTag, 
Duration window) {
+  this.leftTag = leftTag;
+  this.rightTag = rightTag;
+  this.window = window;
+}
+
+public static  BiTemporalJoin create(
+TupleTag leftTag, TupleTag rightTag, Duration window) {
+  return new BiTemporalJoin(leftTag, rightTag, window);
+}
+
+public BiTemporalJoin setGCLimit(int gcLimit) {
+  LEFT_STREAM_GC_LIMIT = gcLimit;
+  return this;
+}
+
+@Override
+public PCollection> 
expand(KeyedPCollectionTuple input) {
+
+  List> collections =
+  input.getKeyedCollections();
+
+  PCollection> leftCollection = null;
+  PCollection> rightCollection = null;
+
+  for (KeyedPCollectionTuple.TaggedKeyedPCollection t : collections) 
{
+
+if (t.getTupleTag().equals(leftTag)) {
+  leftCollection =
+  ((KeyedPCollectionTuple.TaggedKeyedPCollection) 
t).getCollection();
+}
+
+if (t.getTupleTag().equals(rightTag)) {
+  rightCollection =
+  ((KeyedPCollectionTuple.TaggedKeyedPCollection) 
t).getCollection();
+}
+  }
+
+  leftCoder = ((KvCoder) leftCollection.getCoder()).getValueCoder();
+  rightCoder = ((KvCoder) 
rightCollection.getCoder()).getValueCoder();
+
+  BiTemporalJoinResultCoder 

[jira] [Work logged] (BEAM-7520) DirectRunner timers are not strictly time ordered

2019-08-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7520?focusedWorklogId=294932=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-294932
 ]

ASF GitHub Bot logged work on BEAM-7520:


Author: ASF GitHub Bot
Created on: 14/Aug/19 18:50
Start Date: 14/Aug/19 18:50
Worklog Time Spent: 10m 
  Work Description: je-ik commented on pull request #9190: [BEAM-7520] Fix 
timer firing order in DirectRunner
URL: https://github.com/apache/beam/pull/9190#discussion_r314029766
 
 

 ##
 File path: 
runners/direct-java/src/main/java/org/apache/beam/runners/direct/WatermarkCallbackExecutor.java
 ##
 @@ -116,14 +119,30 @@ public void callOnWindowExpiration(
* Schedule all pending callbacks that must have produced output by the time 
of the provided
* watermark.
*/
-  public void fireForWatermark(AppliedPTransform step, Instant 
watermark) {
+  public void fireForWatermark(AppliedPTransform step, Instant 
watermark)
+  throws InterruptedException {
 PriorityQueue callbackQueue = callbacks.get(step);
 if (callbackQueue == null) {
   return;
 }
 synchronized (callbackQueue) {
+  List toFire = new ArrayList<>();
   while (!callbackQueue.isEmpty() && 
callbackQueue.peek().shouldFire(watermark)) {
-executor.execute(callbackQueue.poll().getCallback());
+toFire.add(callbackQueue.poll().getCallback());
+  }
+  if (!toFire.isEmpty()) {
+CountDownLatch latch = new CountDownLatch(toFire.size());
 
 Review comment:
   Hm, my previous comment might suggest, that the change is not needed at all. 
And maybe it is true. :-) I added this code as part of my "I'm really 
desperate" period while solving the issue. Just for being sure, that this 
doesn't break in the future, I'd like to keep it as is, if it is okay. 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 294932)
Time Spent: 4h 50m  (was: 4h 40m)

> DirectRunner timers are not strictly time ordered
> -
>
> Key: BEAM-7520
> URL: https://issues.apache.org/jira/browse/BEAM-7520
> Project: Beam
>  Issue Type: Bug
>  Components: runner-direct
>Affects Versions: 2.13.0
>Reporter: Jan Lukavský
>Assignee: Jan Lukavský
>Priority: Major
>  Time Spent: 4h 50m
>  Remaining Estimate: 0h
>
> Let's suppose we have the following situation:
>  - statful ParDo with two timers - timerA and timerB
>  - timerA is set for window.maxTimestamp() + 1
>  - timerB is set anywhere between  timerB.timestamp
>  - input watermark moves to BoundedWindow.TIMESTAMP_MAX_VALUE
> Then the order of timers is as follows (correct):
>  - timerB
>  - timerA
> But, if timerB sets another timer (say for timerB.timestamp + 1), then the 
> order of timers will be:
>  - timerB (timerB.timestamp)
>  - timerA (BoundedWindow.TIMESTAMP_MAX_VALUE)
>  - timerB (timerB.timestamp + 1)
> Which is not ordered by timestamp. The reason for this is that when the input 
> watermark update is evaluated, the WatermarkManager,extractFiredTimers() will 
> produce both timerA and timerB. That would be correct, but when timerB sets 
> another timer, that breaks this.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (BEAM-7978) ArithmeticExceptions on getting backlog bytes

2019-08-14 Thread Chamikara Jayalath (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-7978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907539#comment-16907539
 ] 

Chamikara Jayalath commented on BEAM-7978:
--

I guess this can occur for other runners as well ?

CCing some folks who recently updated Kinesis:

[~aromanenko] [~iemejia] [~rtshadow] @

 

> ArithmeticExceptions on getting backlog bytes 
> --
>
> Key: BEAM-7978
> URL: https://issues.apache.org/jira/browse/BEAM-7978
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-kinesis
>Affects Versions: 2.14.0
>Reporter: Mateusz
>Priority: Major
>
> Hello,
> Beam 2.14.0
>  (and to be more precise 
> [commit|https://github.com/apache/beam/commit/9fe0a0387ad1f894ef02acf32edbc9623a3b13ad#diff-b4964a457006b1555c7042c739b405ec])
>  introduced a change in watermark calculation in Kinesis IO causing below 
> error:
> {code:java}
> exception:  "java.lang.RuntimeException: Unknown kinesis failure, when trying 
> to reach kinesis
>   at 
> org.apache.beam.sdk.io.kinesis.SimplifiedKinesisClient.wrapExceptions(SimplifiedKinesisClient.java:227)
>   at 
> org.apache.beam.sdk.io.kinesis.SimplifiedKinesisClient.getBacklogBytes(SimplifiedKinesisClient.java:167)
>   at 
> org.apache.beam.sdk.io.kinesis.SimplifiedKinesisClient.getBacklogBytes(SimplifiedKinesisClient.java:155)
>   at 
> org.apache.beam.sdk.io.kinesis.KinesisReader.getTotalBacklogBytes(KinesisReader.java:158)
>   at 
> org.apache.beam.runners.dataflow.worker.StreamingModeExecutionContext.flushState(StreamingModeExecutionContext.java:433)
>   at 
> org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:1289)
>   at 
> org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.access$1000(StreamingDataflowWorker.java:149)
>   at 
> org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker$6.run(StreamingDataflowWorker.java:1024)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ArithmeticException: Value cannot fit in an int: 
> 153748963401
>   at org.joda.time.field.FieldUtils.safeToInt(FieldUtils.java:229)
>   at 
> org.joda.time.field.BaseDurationField.getDifference(BaseDurationField.java:141)
>   at 
> org.joda.time.base.BaseSingleFieldPeriod.between(BaseSingleFieldPeriod.java:72)
>   at org.joda.time.Minutes.minutesBetween(Minutes.java:101)
>   at 
> org.apache.beam.sdk.io.kinesis.SimplifiedKinesisClient.lambda$getBacklogBytes$3(SimplifiedKinesisClient.java:169)
>   at 
> org.apache.beam.sdk.io.kinesis.SimplifiedKinesisClient.wrapExceptions(SimplifiedKinesisClient.java:210)
>   ... 10 more
> {code}
> We spotted this issue on Dataflow runner. It's problematic as inability to 
> get backlog bytes seems to result in constant recreation of KinesisReader.
> The issue happens if the backlog bytes are retrieved before watermark value 
> is updated from initial default value. Easy way to reproduce it is to create 
> a pipeline with Kinesis source for a stream where no records are being put. 
> While debugging it locally, you can observe that the watermark is set to the 
> value on the past (like: "-290308-12-21T19:59:05.225Z"). After two minutes 
> (default watermark idle duration threshold is set to 2 minutes) , the 
> watermark is set to value of 
> [watermarkIdleThreshold|https://github.com/apache/beam/blob/9fe0a0387ad1f894ef02acf32edbc9623a3b13ad/sdks/java/io/kinesis/src/main/java/org/apache/beam/sdk/io/kinesis/WatermarkPolicyFactory.java#L110]),
>  so the next backlog bytes retrieval should be correct. However, as described 
> before, running the pipeline on Dataflow runner results in KinesisReader 
> being closed just after creation, so the watermark won't be fixed.
> The reason of the issue is following: The introduced watermark policies are 
> relying on 
> [WatermarkParameters|https://github.com/apache/beam/blob/9fe0a0387ad1f894ef02acf32edbc9623a3b13ad/sdks/java/io/kinesis/src/main/java/org/apache/beam/sdk/io/kinesis/WatermarkParameters.java]
>  which initialises currentWatermark and eventTime to 
> [BoundedWindow.TIMESTAMP_MIN_VALUE|https://github.com/apache/beam/commit/9fe0a0387ad1f894ef02acf32edbc9623a3b13ad#diff-b4964a457006b1555c7042c739b405ecR52].
>  This result in watermark being set to new Instant(-9223372036854775L) at the 
> KinesisReader creation. Calculated [period between the watermark and the 
> current 
> timestamp|https://github.com/apache/beam/blob/9fe0a0387ad1f894ef02acf32edbc9623a3b13ad/sdks/java/io/kinesis/src/main/java/org/apache/beam/sdk/io/kinesis/SimplifiedKinesisClient.java#L169]
>  is 

[jira] [Work logged] (BEAM-7520) DirectRunner timers are not strictly time ordered

2019-08-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7520?focusedWorklogId=294928=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-294928
 ]

ASF GitHub Bot logged work on BEAM-7520:


Author: ASF GitHub Bot
Created on: 14/Aug/19 18:46
Start Date: 14/Aug/19 18:46
Worklog Time Spent: 10m 
  Work Description: je-ik commented on pull request #9190: [BEAM-7520] Fix 
timer firing order in DirectRunner
URL: https://github.com/apache/beam/pull/9190#discussion_r314028231
 
 

 ##
 File path: 
runners/direct-java/src/main/java/org/apache/beam/runners/direct/StatefulParDoEvaluatorFactory.java
 ##
 @@ -244,27 +249,40 @@ public void 
processElement(WindowedValue>> gbkRes
 delegateEvaluator.processElement(windowedValue);
   }
 
+  Instant lastFired = null;
   for (TimerData timer : gbkResult.getValue().timersIterable()) {
 checkState(
 timer.getNamespace() instanceof WindowNamespace,
 "Expected Timer %s to be in a %s, but got %s",
 timer,
 WindowNamespace.class.getSimpleName(),
 timer.getNamespace().getClass().getName());
-WindowNamespace windowNamespace = (WindowNamespace) 
timer.getNamespace();
-BoundedWindow timerWindow = windowNamespace.getWindow();
-delegateEvaluator.onTimer(timer, timerWindow);
+checkState(
+lastFired == null || !lastFired.isAfter(timer.getTimestamp()),
+"lastFired was %s, current %s",
+lastFired,
+timer.getTimestamp());
+if (lastFired != null && lastFired.isBefore(timer.getTimestamp())) {
+  pushedBackTimers.add(timer);
+} else {
+  lastFired = timer.getTimestamp();
+  WindowNamespace windowNamespace = (WindowNamespace) 
timer.getNamespace();
+  BoundedWindow timerWindow = windowNamespace.getWindow();
+  delegateEvaluator.onTimer(timer, timerWindow);
 
 Review comment:
   I will add the test and check if there are any issues. Regarding the logic 
you described - generally, yes, that would be the ideal approach. It would also 
solve some other issues, that still remain, even after this PR - but - it would 
require a larger refactoring, because there is currently no simple (or simple 
and apparent to me) way of knowing if `@OnTimer` method actually setup any new 
timer. The `delegateEvaluator` would have to signal that, and that boils down 
to `PushbackSideInputDoFnRunner.onTimer` would have to return the new timers. 
That is API change in core and I wanted to avoid that. But maybe I've 
overlooked  some other way.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 294928)
Time Spent: 4h 40m  (was: 4.5h)

> DirectRunner timers are not strictly time ordered
> -
>
> Key: BEAM-7520
> URL: https://issues.apache.org/jira/browse/BEAM-7520
> Project: Beam
>  Issue Type: Bug
>  Components: runner-direct
>Affects Versions: 2.13.0
>Reporter: Jan Lukavský
>Assignee: Jan Lukavský
>Priority: Major
>  Time Spent: 4h 40m
>  Remaining Estimate: 0h
>
> Let's suppose we have the following situation:
>  - statful ParDo with two timers - timerA and timerB
>  - timerA is set for window.maxTimestamp() + 1
>  - timerB is set anywhere between  timerB.timestamp
>  - input watermark moves to BoundedWindow.TIMESTAMP_MAX_VALUE
> Then the order of timers is as follows (correct):
>  - timerB
>  - timerA
> But, if timerB sets another timer (say for timerB.timestamp + 1), then the 
> order of timers will be:
>  - timerB (timerB.timestamp)
>  - timerA (BoundedWindow.TIMESTAMP_MAX_VALUE)
>  - timerB (timerB.timestamp + 1)
> Which is not ordered by timestamp. The reason for this is that when the input 
> watermark update is evaluated, the WatermarkManager,extractFiredTimers() will 
> produce both timerA and timerB. That would be correct, but when timerB sets 
> another timer, that breaks this.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7520) DirectRunner timers are not strictly time ordered

2019-08-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7520?focusedWorklogId=294924=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-294924
 ]

ASF GitHub Bot logged work on BEAM-7520:


Author: ASF GitHub Bot
Created on: 14/Aug/19 18:41
Start Date: 14/Aug/19 18:41
Worklog Time Spent: 10m 
  Work Description: je-ik commented on pull request #9190: [BEAM-7520] Fix 
timer firing order in DirectRunner
URL: https://github.com/apache/beam/pull/9190#discussion_r314025873
 
 

 ##
 File path: 
runners/direct-java/src/main/java/org/apache/beam/runners/direct/StatefulParDoEvaluatorFactory.java
 ##
 @@ -244,27 +249,40 @@ public void 
processElement(WindowedValue>> gbkRes
 delegateEvaluator.processElement(windowedValue);
   }
 
+  Instant lastFired = null;
 
 Review comment:
   :+1:
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 294924)
Time Spent: 4.5h  (was: 4h 20m)

> DirectRunner timers are not strictly time ordered
> -
>
> Key: BEAM-7520
> URL: https://issues.apache.org/jira/browse/BEAM-7520
> Project: Beam
>  Issue Type: Bug
>  Components: runner-direct
>Affects Versions: 2.13.0
>Reporter: Jan Lukavský
>Assignee: Jan Lukavský
>Priority: Major
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> Let's suppose we have the following situation:
>  - statful ParDo with two timers - timerA and timerB
>  - timerA is set for window.maxTimestamp() + 1
>  - timerB is set anywhere between  timerB.timestamp
>  - input watermark moves to BoundedWindow.TIMESTAMP_MAX_VALUE
> Then the order of timers is as follows (correct):
>  - timerB
>  - timerA
> But, if timerB sets another timer (say for timerB.timestamp + 1), then the 
> order of timers will be:
>  - timerB (timerB.timestamp)
>  - timerA (BoundedWindow.TIMESTAMP_MAX_VALUE)
>  - timerB (timerB.timestamp + 1)
> Which is not ordered by timestamp. The reason for this is that when the input 
> watermark update is evaluated, the WatermarkManager,extractFiredTimers() will 
> produce both timerA and timerB. That would be correct, but when timerB sets 
> another timer, that breaks this.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7520) DirectRunner timers are not strictly time ordered

2019-08-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7520?focusedWorklogId=294923=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-294923
 ]

ASF GitHub Bot logged work on BEAM-7520:


Author: ASF GitHub Bot
Created on: 14/Aug/19 18:40
Start Date: 14/Aug/19 18:40
Worklog Time Spent: 10m 
  Work Description: je-ik commented on pull request #9190: [BEAM-7520] Fix 
timer firing order in DirectRunner
URL: https://github.com/apache/beam/pull/9190#discussion_r314025524
 
 

 ##
 File path: 
runners/direct-java/src/main/java/org/apache/beam/runners/direct/WatermarkCallbackExecutor.java
 ##
 @@ -116,14 +119,30 @@ public void callOnWindowExpiration(
* Schedule all pending callbacks that must have produced output by the time 
of the provided
* watermark.
*/
-  public void fireForWatermark(AppliedPTransform step, Instant 
watermark) {
+  public void fireForWatermark(AppliedPTransform step, Instant 
watermark)
+  throws InterruptedException {
 PriorityQueue callbackQueue = callbacks.get(step);
 if (callbackQueue == null) {
   return;
 }
 synchronized (callbackQueue) {
+  List toFire = new ArrayList<>();
   while (!callbackQueue.isEmpty() && 
callbackQueue.peek().shouldFire(watermark)) {
-executor.execute(callbackQueue.poll().getCallback());
+toFire.add(callbackQueue.poll().getCallback());
+  }
+  if (!toFire.isEmpty()) {
+CountDownLatch latch = new CountDownLatch(toFire.size());
 
 Review comment:
   If I understand it correctly, then this in unfortunately currently not 
possible, because the `executor` is not `ExecutorService`, but simply 
`Executor`, which returns `void`, instead of `Future`. I cannot simply change 
it to `ExecutorService`, because `EvaluationContext` passes there 
`MoreExecutors.directExecutor()`, which is not `ExecutorService`.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 294923)
Time Spent: 4h 20m  (was: 4h 10m)

> DirectRunner timers are not strictly time ordered
> -
>
> Key: BEAM-7520
> URL: https://issues.apache.org/jira/browse/BEAM-7520
> Project: Beam
>  Issue Type: Bug
>  Components: runner-direct
>Affects Versions: 2.13.0
>Reporter: Jan Lukavský
>Assignee: Jan Lukavský
>Priority: Major
>  Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
> Let's suppose we have the following situation:
>  - statful ParDo with two timers - timerA and timerB
>  - timerA is set for window.maxTimestamp() + 1
>  - timerB is set anywhere between  timerB.timestamp
>  - input watermark moves to BoundedWindow.TIMESTAMP_MAX_VALUE
> Then the order of timers is as follows (correct):
>  - timerB
>  - timerA
> But, if timerB sets another timer (say for timerB.timestamp + 1), then the 
> order of timers will be:
>  - timerB (timerB.timestamp)
>  - timerA (BoundedWindow.TIMESTAMP_MAX_VALUE)
>  - timerB (timerB.timestamp + 1)
> Which is not ordered by timestamp. The reason for this is that when the input 
> watermark update is evaluated, the WatermarkManager,extractFiredTimers() will 
> produce both timerA and timerB. That would be correct, but when timerB sets 
> another timer, that breaks this.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7972) Portable Python Reshuffle does not work with with windowed pcollection

2019-08-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7972?focusedWorklogId=294915=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-294915
 ]

ASF GitHub Bot logged work on BEAM-7972:


Author: ASF GitHub Bot
Created on: 14/Aug/19 18:22
Start Date: 14/Aug/19 18:22
Worklog Time Spent: 10m 
  Work Description: angoenka commented on issue #9334: [BEAM-7972] Always 
use Global window in reshuffle and then apply wind…
URL: https://github.com/apache/beam/pull/9334#issuecomment-521360870
 
 
   R: @lukecwik 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 294915)
Time Spent: 1h  (was: 50m)

> Portable Python Reshuffle does not work with with windowed pcollection
> --
>
> Key: BEAM-7972
> URL: https://issues.apache.org/jira/browse/BEAM-7972
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Ankur Goenka
>Assignee: Ankur Goenka
>Priority: Major
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Streaming pipeline gets stuck when using Reshuffle with windowed pcollection.
> The issue happen because of window function gets deserialized on java side 
> which is not possible and hence default to global window function and result 
> into window function mismatch later down the code.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7973) Python doesn't shut down Flink job server properly

2019-08-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7973?focusedWorklogId=294906=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-294906
 ]

ASF GitHub Bot logged work on BEAM-7973:


Author: ASF GitHub Bot
Created on: 14/Aug/19 18:10
Start Date: 14/Aug/19 18:10
Worklog Time Spent: 10m 
  Work Description: ibzib commented on pull request #9340: [BEAM-7973] py: 
shut down Flink job server automatically
URL: https://github.com/apache/beam/pull/9340
 
 
   R: @robertwb 
   
   Post-Commit Tests Status (on master branch)
   

   
   Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark
   --- | --- | --- | --- | --- | --- | --- | ---
   Go | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/)
 | --- | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/)
 | --- | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/)
   Java | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/)
   Python | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Python37/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python37/lastCompletedBuild/)
 | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Py_VR_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Py_VR_Dataflow/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Py_ValCont/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Py_ValCont/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PreCommit_Python_PVR_Flink_Cron/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PreCommit_Python_PVR_Flink_Cron/lastCompletedBuild/)
 | --- | --- | [![Build 

[jira] [Updated] (BEAM-7973) Python doesn't shut down Flink job server properly

2019-08-14 Thread Kyle Weaver (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kyle Weaver updated BEAM-7973:
--
Summary: Python doesn't shut down Flink job server properly  (was: Python 
doesn't shut down job server properly)

> Python doesn't shut down Flink job server properly
> --
>
> Key: BEAM-7973
> URL: https://issues.apache.org/jira/browse/BEAM-7973
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-harness
>Reporter: Kyle Weaver
>Priority: Major
>  Labels: portability-flink
>
> Using the new Python FlinkRunner [1], a new job server is created and the job 
> succeeds, but seemingly not being shut down properly when the Python command 
> exits. Specifically, the java -jar command that started the job server is 
> still running in the background, eating up memory.
> Relevant args:
> python ...
>  --runner FlinkRunner \ 
>  --flink_job_server_jar $FLINK_JOB_SERVER_JAR ...
> [1] [https://github.com/apache/beam/pull/9043]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (BEAM-7520) DirectRunner timers are not strictly time ordered

2019-08-14 Thread Kenneth Knowles (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-7520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907502#comment-16907502
 ] 

Kenneth Knowles commented on BEAM-7520:
---

... and that new timer will be committed as part of the result first, but will 
be fired right away in the next bundle. This is incorrect because of the 
example you showed.

> DirectRunner timers are not strictly time ordered
> -
>
> Key: BEAM-7520
> URL: https://issues.apache.org/jira/browse/BEAM-7520
> Project: Beam
>  Issue Type: Bug
>  Components: runner-direct
>Affects Versions: 2.13.0
>Reporter: Jan Lukavský
>Assignee: Jan Lukavský
>Priority: Major
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> Let's suppose we have the following situation:
>  - statful ParDo with two timers - timerA and timerB
>  - timerA is set for window.maxTimestamp() + 1
>  - timerB is set anywhere between  timerB.timestamp
>  - input watermark moves to BoundedWindow.TIMESTAMP_MAX_VALUE
> Then the order of timers is as follows (correct):
>  - timerB
>  - timerA
> But, if timerB sets another timer (say for timerB.timestamp + 1), then the 
> order of timers will be:
>  - timerB (timerB.timestamp)
>  - timerA (BoundedWindow.TIMESTAMP_MAX_VALUE)
>  - timerB (timerB.timestamp + 1)
> Which is not ordered by timestamp. The reason for this is that when the input 
> watermark update is evaluated, the WatermarkManager,extractFiredTimers() will 
> produce both timerA and timerB. That would be correct, but when timerB sets 
> another timer, that breaks this.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (BEAM-7520) DirectRunner timers are not strictly time ordered

2019-08-14 Thread JIRA


[ 
https://issues.apache.org/jira/browse/BEAM-7520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907497#comment-16907497
 ] 

Jan Lukavský commented on BEAM-7520:


Actually, the issue is that DirectRunner does the following (asynchronously):
 * update input watermark
 * extract timers that should be fired and create an (immutable) bundle from 
them
 * pass this bundle for processing, the processing can result in more timers 
being set
With this model, when watermark moves, a timer can be fired, although some 
other timer (extracted to the same bundle) might setup timer for lower 
timestamp.

> DirectRunner timers are not strictly time ordered
> -
>
> Key: BEAM-7520
> URL: https://issues.apache.org/jira/browse/BEAM-7520
> Project: Beam
>  Issue Type: Bug
>  Components: runner-direct
>Affects Versions: 2.13.0
>Reporter: Jan Lukavský
>Assignee: Jan Lukavský
>Priority: Major
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> Let's suppose we have the following situation:
>  - statful ParDo with two timers - timerA and timerB
>  - timerA is set for window.maxTimestamp() + 1
>  - timerB is set anywhere between  timerB.timestamp
>  - input watermark moves to BoundedWindow.TIMESTAMP_MAX_VALUE
> Then the order of timers is as follows (correct):
>  - timerB
>  - timerA
> But, if timerB sets another timer (say for timerB.timestamp + 1), then the 
> order of timers will be:
>  - timerB (timerB.timestamp)
>  - timerA (BoundedWindow.TIMESTAMP_MAX_VALUE)
>  - timerB (timerB.timestamp + 1)
> Which is not ordered by timestamp. The reason for this is that when the input 
> watermark update is evaluated, the WatermarkManager,extractFiredTimers() will 
> produce both timerA and timerB. That would be correct, but when timerB sets 
> another timer, that breaks this.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7711) Support DATETIME as a logical type in BeamSQL

2019-08-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7711?focusedWorklogId=294894=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-294894
 ]

ASF GitHub Bot logged work on BEAM-7711:


Author: ASF GitHub Bot
Created on: 14/Aug/19 17:43
Start Date: 14/Aug/19 17:43
Worklog Time Spent: 10m 
  Work Description: amaliujia commented on pull request #8994: [BEAM-7711] 
Add DATETIME as a logical type in BeamSQL
URL: https://github.com/apache/beam/pull/8994#discussion_r31432
 
 

 ##
 File path: 
sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/utils/CalciteUtils.java
 ##
 @@ -154,6 +163,7 @@ public static boolean isStringType(FieldType fieldType) {
   .put(TIME_WITH_LOCAL_TZ, SqlTypeName.TIME_WITH_LOCAL_TIME_ZONE)
   .put(TIMESTAMP, SqlTypeName.TIMESTAMP)
   .put(TIMESTAMP_WITH_LOCAL_TZ, 
SqlTypeName.TIMESTAMP_WITH_LOCAL_TIME_ZONE)
+  .put(DATETIME, SqlTypeName.TIMESTAMP)
 
 Review comment:
   The problem is Beam ZetaSQL will only see Calcite schema. If we treat 
`DATETIME` as Calcite's `TIMESTAMP` (thus mix`DATETIME` with `TIMESTAMP WITH 
TIME ZONE`), it means Beam ZetaSQL loses the information of `DATETIME` type in 
plans.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 294894)
Time Spent: 2.5h  (was: 2h 20m)

> Support DATETIME as a logical type in BeamSQL
> -
>
> Key: BEAM-7711
> URL: https://issues.apache.org/jira/browse/BEAM-7711
> Project: Beam
>  Issue Type: New Feature
>  Components: dsl-sql
>Reporter: Rui Wang
>Priority: Major
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> DATETIME as a type represents a year, month, day, hour, minute, second, and 
> subsecond(millis)
> it ranges from 0001-01-01 00:00:00 to -12-31 23:59:59.999.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7711) Support DATETIME as a logical type in BeamSQL

2019-08-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7711?focusedWorklogId=294893=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-294893
 ]

ASF GitHub Bot logged work on BEAM-7711:


Author: ASF GitHub Bot
Created on: 14/Aug/19 17:42
Start Date: 14/Aug/19 17:42
Worklog Time Spent: 10m 
  Work Description: amaliujia commented on pull request #8994: [BEAM-7711] 
Add DATETIME as a logical type in BeamSQL
URL: https://github.com/apache/beam/pull/8994#discussion_r31432
 
 

 ##
 File path: 
sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/utils/CalciteUtils.java
 ##
 @@ -154,6 +163,7 @@ public static boolean isStringType(FieldType fieldType) {
   .put(TIME_WITH_LOCAL_TZ, SqlTypeName.TIME_WITH_LOCAL_TIME_ZONE)
   .put(TIMESTAMP, SqlTypeName.TIMESTAMP)
   .put(TIMESTAMP_WITH_LOCAL_TZ, 
SqlTypeName.TIMESTAMP_WITH_LOCAL_TIME_ZONE)
+  .put(DATETIME, SqlTypeName.TIMESTAMP)
 
 Review comment:
   The problem is Beam ZetaSQL will only see Calcite schema. If we treat 
`DATETIME` as Calcite's `TIMESTAMP` (thus mixed `DATETIME` with `TIMESTAMP WITH 
TIME ZONE`), it means Beam ZetaSQL loses the information of `DATETIME` type in 
plans.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 294893)
Time Spent: 2h 20m  (was: 2h 10m)

> Support DATETIME as a logical type in BeamSQL
> -
>
> Key: BEAM-7711
> URL: https://issues.apache.org/jira/browse/BEAM-7711
> Project: Beam
>  Issue Type: New Feature
>  Components: dsl-sql
>Reporter: Rui Wang
>Priority: Major
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> DATETIME as a type represents a year, month, day, hour, minute, second, and 
> subsecond(millis)
> it ranges from 0001-01-01 00:00:00 to -12-31 23:59:59.999.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7711) Support DATETIME as a logical type in BeamSQL

2019-08-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7711?focusedWorklogId=294892=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-294892
 ]

ASF GitHub Bot logged work on BEAM-7711:


Author: ASF GitHub Bot
Created on: 14/Aug/19 17:42
Start Date: 14/Aug/19 17:42
Worklog Time Spent: 10m 
  Work Description: amaliujia commented on pull request #8994: [BEAM-7711] 
Add DATETIME as a logical type in BeamSQL
URL: https://github.com/apache/beam/pull/8994#discussion_r31432
 
 

 ##
 File path: 
sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/utils/CalciteUtils.java
 ##
 @@ -154,6 +163,7 @@ public static boolean isStringType(FieldType fieldType) {
   .put(TIME_WITH_LOCAL_TZ, SqlTypeName.TIME_WITH_LOCAL_TIME_ZONE)
   .put(TIMESTAMP, SqlTypeName.TIMESTAMP)
   .put(TIMESTAMP_WITH_LOCAL_TZ, 
SqlTypeName.TIMESTAMP_WITH_LOCAL_TIME_ZONE)
+  .put(DATETIME, SqlTypeName.TIMESTAMP)
 
 Review comment:
   The problem is Beam ZetaSQL will only see Calcite Schema. If we treat 
`DATETIME` as Calcite's `TIMESTAMP` (thus mixed `DATETIME` with `TIMESTAMP WITH 
TIME ZONE`), it means Beam ZetaSQL loses the information of `DATETIME` type in 
plans.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 294892)
Time Spent: 2h 10m  (was: 2h)

> Support DATETIME as a logical type in BeamSQL
> -
>
> Key: BEAM-7711
> URL: https://issues.apache.org/jira/browse/BEAM-7711
> Project: Beam
>  Issue Type: New Feature
>  Components: dsl-sql
>Reporter: Rui Wang
>Priority: Major
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> DATETIME as a type represents a year, month, day, hour, minute, second, and 
> subsecond(millis)
> it ranges from 0001-01-01 00:00:00 to -12-31 23:59:59.999.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7711) Support DATETIME as a logical type in BeamSQL

2019-08-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7711?focusedWorklogId=294890=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-294890
 ]

ASF GitHub Bot logged work on BEAM-7711:


Author: ASF GitHub Bot
Created on: 14/Aug/19 17:38
Start Date: 14/Aug/19 17:38
Worklog Time Spent: 10m 
  Work Description: amaliujia commented on pull request #8994: [BEAM-7711] 
Add DATETIME as a logical type in BeamSQL
URL: https://github.com/apache/beam/pull/8994#discussion_r313997918
 
 

 ##
 File path: 
sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/utils/CalciteUtils.java
 ##
 @@ -154,6 +163,7 @@ public static boolean isStringType(FieldType fieldType) {
   .put(TIME_WITH_LOCAL_TZ, SqlTypeName.TIME_WITH_LOCAL_TIME_ZONE)
   .put(TIMESTAMP, SqlTypeName.TIMESTAMP)
   .put(TIMESTAMP_WITH_LOCAL_TZ, 
SqlTypeName.TIMESTAMP_WITH_LOCAL_TIME_ZONE)
+  .put(DATETIME, SqlTypeName.TIMESTAMP)
 
 Review comment:
   I cannot find a precise definition of `TIMESTMAP WITHOUT TIME ZONE` from 
Calcite so I don't really know the difference.
   
   However Calcite maps both `TIMESTMAP WITHOUT TIME ZONE` and ``TIMESTMAP WITH 
TIME ZONE` to `TIMESTAMP`
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 294890)
Time Spent: 1h 50m  (was: 1h 40m)

> Support DATETIME as a logical type in BeamSQL
> -
>
> Key: BEAM-7711
> URL: https://issues.apache.org/jira/browse/BEAM-7711
> Project: Beam
>  Issue Type: New Feature
>  Components: dsl-sql
>Reporter: Rui Wang
>Priority: Major
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> DATETIME as a type represents a year, month, day, hour, minute, second, and 
> subsecond(millis)
> it ranges from 0001-01-01 00:00:00 to -12-31 23:59:59.999.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7711) Support DATETIME as a logical type in BeamSQL

2019-08-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7711?focusedWorklogId=294891=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-294891
 ]

ASF GitHub Bot logged work on BEAM-7711:


Author: ASF GitHub Bot
Created on: 14/Aug/19 17:38
Start Date: 14/Aug/19 17:38
Worklog Time Spent: 10m 
  Work Description: amaliujia commented on pull request #8994: [BEAM-7711] 
Add DATETIME as a logical type in BeamSQL
URL: https://github.com/apache/beam/pull/8994#discussion_r313997918
 
 

 ##
 File path: 
sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/utils/CalciteUtils.java
 ##
 @@ -154,6 +163,7 @@ public static boolean isStringType(FieldType fieldType) {
   .put(TIME_WITH_LOCAL_TZ, SqlTypeName.TIME_WITH_LOCAL_TIME_ZONE)
   .put(TIMESTAMP, SqlTypeName.TIMESTAMP)
   .put(TIMESTAMP_WITH_LOCAL_TZ, 
SqlTypeName.TIMESTAMP_WITH_LOCAL_TIME_ZONE)
+  .put(DATETIME, SqlTypeName.TIMESTAMP)
 
 Review comment:
   I cannot find a precise definition of `TIMESTMAP WITHOUT TIME ZONE` from 
Calcite so I don't really know the difference.
   
   However Calcite maps both `TIMESTMAP WITHOUT TIME ZONE` and `TIMESTMAP WITH 
TIME ZONE` to `TIMESTAMP`
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 294891)
Time Spent: 2h  (was: 1h 50m)

> Support DATETIME as a logical type in BeamSQL
> -
>
> Key: BEAM-7711
> URL: https://issues.apache.org/jira/browse/BEAM-7711
> Project: Beam
>  Issue Type: New Feature
>  Components: dsl-sql
>Reporter: Rui Wang
>Priority: Major
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> DATETIME as a type represents a year, month, day, hour, minute, second, and 
> subsecond(millis)
> it ranges from 0001-01-01 00:00:00 to -12-31 23:59:59.999.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7711) Support DATETIME as a logical type in BeamSQL

2019-08-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7711?focusedWorklogId=294889=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-294889
 ]

ASF GitHub Bot logged work on BEAM-7711:


Author: ASF GitHub Bot
Created on: 14/Aug/19 17:38
Start Date: 14/Aug/19 17:38
Worklog Time Spent: 10m 
  Work Description: amaliujia commented on pull request #8994: [BEAM-7711] 
Add DATETIME as a logical type in BeamSQL
URL: https://github.com/apache/beam/pull/8994#discussion_r313997918
 
 

 ##
 File path: 
sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/utils/CalciteUtils.java
 ##
 @@ -154,6 +163,7 @@ public static boolean isStringType(FieldType fieldType) {
   .put(TIME_WITH_LOCAL_TZ, SqlTypeName.TIME_WITH_LOCAL_TIME_ZONE)
   .put(TIMESTAMP, SqlTypeName.TIMESTAMP)
   .put(TIMESTAMP_WITH_LOCAL_TZ, 
SqlTypeName.TIMESTAMP_WITH_LOCAL_TIME_ZONE)
+  .put(DATETIME, SqlTypeName.TIMESTAMP)
 
 Review comment:
   I cannot find a precious definition of `TIMESTMAP WITHOUT TIME ZONE` from 
Calcite so I don't really know the difference.
   
   However Calcite maps both `TIMESTMAP WITHOUT TIME ZONE` and ``TIMESTMAP WITH 
TIME ZONE` to `TIMESTAMP`
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 294889)
Time Spent: 1h 40m  (was: 1.5h)

> Support DATETIME as a logical type in BeamSQL
> -
>
> Key: BEAM-7711
> URL: https://issues.apache.org/jira/browse/BEAM-7711
> Project: Beam
>  Issue Type: New Feature
>  Components: dsl-sql
>Reporter: Rui Wang
>Priority: Major
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> DATETIME as a type represents a year, month, day, hour, minute, second, and 
> subsecond(millis)
> it ranges from 0001-01-01 00:00:00 to -12-31 23:59:59.999.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7520) DirectRunner timers are not strictly time ordered

2019-08-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7520?focusedWorklogId=294861=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-294861
 ]

ASF GitHub Bot logged work on BEAM-7520:


Author: ASF GitHub Bot
Created on: 14/Aug/19 16:49
Start Date: 14/Aug/19 16:49
Worklog Time Spent: 10m 
  Work Description: kennknowles commented on pull request #9190: 
[BEAM-7520] Fix timer firing order in DirectRunner
URL: https://github.com/apache/beam/pull/9190#discussion_r313975188
 
 

 ##
 File path: 
runners/direct-java/src/main/java/org/apache/beam/runners/direct/WatermarkCallbackExecutor.java
 ##
 @@ -116,14 +119,30 @@ public void callOnWindowExpiration(
* Schedule all pending callbacks that must have produced output by the time 
of the provided
* watermark.
*/
-  public void fireForWatermark(AppliedPTransform step, Instant 
watermark) {
+  public void fireForWatermark(AppliedPTransform step, Instant 
watermark)
+  throws InterruptedException {
 PriorityQueue callbackQueue = callbacks.get(step);
 if (callbackQueue == null) {
   return;
 }
 synchronized (callbackQueue) {
+  List toFire = new ArrayList<>();
   while (!callbackQueue.isEmpty() && 
callbackQueue.peek().shouldFire(watermark)) {
-executor.execute(callbackQueue.poll().getCallback());
+toFire.add(callbackQueue.poll().getCallback());
+  }
+  if (!toFire.isEmpty()) {
+CountDownLatch latch = new CountDownLatch(toFire.size());
 
 Review comment:
   Instead of managing the latch yourself, you could gather a list of futures 
and wait on it. It would be a bit higher level programming style, and not 
depend on capturing the latch in the closure, etc.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 294861)
Time Spent: 4h 10m  (was: 4h)

> DirectRunner timers are not strictly time ordered
> -
>
> Key: BEAM-7520
> URL: https://issues.apache.org/jira/browse/BEAM-7520
> Project: Beam
>  Issue Type: Bug
>  Components: runner-direct
>Affects Versions: 2.13.0
>Reporter: Jan Lukavský
>Assignee: Jan Lukavský
>Priority: Major
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> Let's suppose we have the following situation:
>  - statful ParDo with two timers - timerA and timerB
>  - timerA is set for window.maxTimestamp() + 1
>  - timerB is set anywhere between  timerB.timestamp
>  - input watermark moves to BoundedWindow.TIMESTAMP_MAX_VALUE
> Then the order of timers is as follows (correct):
>  - timerB
>  - timerA
> But, if timerB sets another timer (say for timerB.timestamp + 1), then the 
> order of timers will be:
>  - timerB (timerB.timestamp)
>  - timerA (BoundedWindow.TIMESTAMP_MAX_VALUE)
>  - timerB (timerB.timestamp + 1)
> Which is not ordered by timestamp. The reason for this is that when the input 
> watermark update is evaluated, the WatermarkManager,extractFiredTimers() will 
> produce both timerA and timerB. That would be correct, but when timerB sets 
> another timer, that breaks this.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7520) DirectRunner timers are not strictly time ordered

2019-08-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7520?focusedWorklogId=294862=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-294862
 ]

ASF GitHub Bot logged work on BEAM-7520:


Author: ASF GitHub Bot
Created on: 14/Aug/19 16:49
Start Date: 14/Aug/19 16:49
Worklog Time Spent: 10m 
  Work Description: kennknowles commented on pull request #9190: 
[BEAM-7520] Fix timer firing order in DirectRunner
URL: https://github.com/apache/beam/pull/9190#discussion_r313975293
 
 

 ##
 File path: 
runners/direct-java/src/main/java/org/apache/beam/runners/direct/StatefulParDoEvaluatorFactory.java
 ##
 @@ -244,27 +249,40 @@ public void 
processElement(WindowedValue>> gbkRes
 delegateEvaluator.processElement(windowedValue);
   }
 
+  Instant lastFired = null;
 
 Review comment:
   `@Nullable`
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 294862)

> DirectRunner timers are not strictly time ordered
> -
>
> Key: BEAM-7520
> URL: https://issues.apache.org/jira/browse/BEAM-7520
> Project: Beam
>  Issue Type: Bug
>  Components: runner-direct
>Affects Versions: 2.13.0
>Reporter: Jan Lukavský
>Assignee: Jan Lukavský
>Priority: Major
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> Let's suppose we have the following situation:
>  - statful ParDo with two timers - timerA and timerB
>  - timerA is set for window.maxTimestamp() + 1
>  - timerB is set anywhere between  timerB.timestamp
>  - input watermark moves to BoundedWindow.TIMESTAMP_MAX_VALUE
> Then the order of timers is as follows (correct):
>  - timerB
>  - timerA
> But, if timerB sets another timer (say for timerB.timestamp + 1), then the 
> order of timers will be:
>  - timerB (timerB.timestamp)
>  - timerA (BoundedWindow.TIMESTAMP_MAX_VALUE)
>  - timerB (timerB.timestamp + 1)
> Which is not ordered by timestamp. The reason for this is that when the input 
> watermark update is evaluated, the WatermarkManager,extractFiredTimers() will 
> produce both timerA and timerB. That would be correct, but when timerB sets 
> another timer, that breaks this.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7520) DirectRunner timers are not strictly time ordered

2019-08-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7520?focusedWorklogId=294863=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-294863
 ]

ASF GitHub Bot logged work on BEAM-7520:


Author: ASF GitHub Bot
Created on: 14/Aug/19 16:49
Start Date: 14/Aug/19 16:49
Worklog Time Spent: 10m 
  Work Description: kennknowles commented on pull request #9190: 
[BEAM-7520] Fix timer firing order in DirectRunner
URL: https://github.com/apache/beam/pull/9190#discussion_r313977188
 
 

 ##
 File path: 
runners/direct-java/src/main/java/org/apache/beam/runners/direct/StatefulParDoEvaluatorFactory.java
 ##
 @@ -244,27 +249,40 @@ public void 
processElement(WindowedValue>> gbkRes
 delegateEvaluator.processElement(windowedValue);
   }
 
+  Instant lastFired = null;
   for (TimerData timer : gbkResult.getValue().timersIterable()) {
 checkState(
 timer.getNamespace() instanceof WindowNamespace,
 "Expected Timer %s to be in a %s, but got %s",
 timer,
 WindowNamespace.class.getSimpleName(),
 timer.getNamespace().getClass().getName());
-WindowNamespace windowNamespace = (WindowNamespace) 
timer.getNamespace();
-BoundedWindow timerWindow = windowNamespace.getWindow();
-delegateEvaluator.onTimer(timer, timerWindow);
+checkState(
+lastFired == null || !lastFired.isAfter(timer.getTimestamp()),
+"lastFired was %s, current %s",
+lastFired,
+timer.getTimestamp());
+if (lastFired != null && lastFired.isBefore(timer.getTimestamp())) {
+  pushedBackTimers.add(timer);
+} else {
+  lastFired = timer.getTimestamp();
+  WindowNamespace windowNamespace = (WindowNamespace) 
timer.getNamespace();
+  BoundedWindow timerWindow = windowNamespace.getWindow();
+  delegateEvaluator.onTimer(timer, timerWindow);
 
 Review comment:
   I'm reading this really quickly, but I think it works only for loops of the 
same timer, not the general case. And it also leaves the "should fire" implicit 
only by the fact that it already fired, instead of checking it. I think if you 
have a unit test where two timer callbacks set each other to "already firing" 
times you will uncover some issues.
   
   I think there's a simple logic you might have:
   
   1. Add the fired timers in the incoming bundle to a priority queue
   2. While any timers to fire, call the first callback
   a. For any timer set by the `@OnTimer` callback, you check if it should 
fire. If yes, then add to fired timers priority queue. If no, then add it to 
the committed result.
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 294863)

> DirectRunner timers are not strictly time ordered
> -
>
> Key: BEAM-7520
> URL: https://issues.apache.org/jira/browse/BEAM-7520
> Project: Beam
>  Issue Type: Bug
>  Components: runner-direct
>Affects Versions: 2.13.0
>Reporter: Jan Lukavský
>Assignee: Jan Lukavský
>Priority: Major
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> Let's suppose we have the following situation:
>  - statful ParDo with two timers - timerA and timerB
>  - timerA is set for window.maxTimestamp() + 1
>  - timerB is set anywhere between  timerB.timestamp
>  - input watermark moves to BoundedWindow.TIMESTAMP_MAX_VALUE
> Then the order of timers is as follows (correct):
>  - timerB
>  - timerA
> But, if timerB sets another timer (say for timerB.timestamp + 1), then the 
> order of timers will be:
>  - timerB (timerB.timestamp)
>  - timerA (BoundedWindow.TIMESTAMP_MAX_VALUE)
>  - timerB (timerB.timestamp + 1)
> Which is not ordered by timestamp. The reason for this is that when the input 
> watermark update is evaluated, the WatermarkManager,extractFiredTimers() will 
> produce both timerA and timerB. That would be correct, but when timerB sets 
> another timer, that breaks this.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7520) DirectRunner timers are not strictly time ordered

2019-08-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7520?focusedWorklogId=294852=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-294852
 ]

ASF GitHub Bot logged work on BEAM-7520:


Author: ASF GitHub Bot
Created on: 14/Aug/19 16:40
Start Date: 14/Aug/19 16:40
Worklog Time Spent: 10m 
  Work Description: reuvenlax commented on issue #9190: [BEAM-7520] Fix 
timer firing order in DirectRunner
URL: https://github.com/apache/beam/pull/9190#issuecomment-521323520
 
 
   R: @dpmills 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 294852)
Time Spent: 4h  (was: 3h 50m)

> DirectRunner timers are not strictly time ordered
> -
>
> Key: BEAM-7520
> URL: https://issues.apache.org/jira/browse/BEAM-7520
> Project: Beam
>  Issue Type: Bug
>  Components: runner-direct
>Affects Versions: 2.13.0
>Reporter: Jan Lukavský
>Assignee: Jan Lukavský
>Priority: Major
>  Time Spent: 4h
>  Remaining Estimate: 0h
>
> Let's suppose we have the following situation:
>  - statful ParDo with two timers - timerA and timerB
>  - timerA is set for window.maxTimestamp() + 1
>  - timerB is set anywhere between  timerB.timestamp
>  - input watermark moves to BoundedWindow.TIMESTAMP_MAX_VALUE
> Then the order of timers is as follows (correct):
>  - timerB
>  - timerA
> But, if timerB sets another timer (say for timerB.timestamp + 1), then the 
> order of timers will be:
>  - timerB (timerB.timestamp)
>  - timerA (BoundedWindow.TIMESTAMP_MAX_VALUE)
>  - timerB (timerB.timestamp + 1)
> Which is not ordered by timestamp. The reason for this is that when the input 
> watermark update is evaluated, the WatermarkManager,extractFiredTimers() will 
> produce both timerA and timerB. That would be correct, but when timerB sets 
> another timer, that breaks this.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (BEAM-7520) DirectRunner timers are not strictly time ordered

2019-08-14 Thread Kenneth Knowles (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-7520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907424#comment-16907424
 ] 

Kenneth Knowles commented on BEAM-7520:
---

Nice catch. A priority queue of timers should already work, but the bug is 
probably that timers are not added to the queue until a bundle is committed, 
right?

> DirectRunner timers are not strictly time ordered
> -
>
> Key: BEAM-7520
> URL: https://issues.apache.org/jira/browse/BEAM-7520
> Project: Beam
>  Issue Type: Bug
>  Components: runner-direct
>Affects Versions: 2.13.0
>Reporter: Jan Lukavský
>Assignee: Jan Lukavský
>Priority: Major
>  Time Spent: 4h
>  Remaining Estimate: 0h
>
> Let's suppose we have the following situation:
>  - statful ParDo with two timers - timerA and timerB
>  - timerA is set for window.maxTimestamp() + 1
>  - timerB is set anywhere between  timerB.timestamp
>  - input watermark moves to BoundedWindow.TIMESTAMP_MAX_VALUE
> Then the order of timers is as follows (correct):
>  - timerB
>  - timerA
> But, if timerB sets another timer (say for timerB.timestamp + 1), then the 
> order of timers will be:
>  - timerB (timerB.timestamp)
>  - timerA (BoundedWindow.TIMESTAMP_MAX_VALUE)
>  - timerB (timerB.timestamp + 1)
> Which is not ordered by timestamp. The reason for this is that when the input 
> watermark update is evaluated, the WatermarkManager,extractFiredTimers() will 
> produce both timerA and timerB. That would be correct, but when timerB sets 
> another timer, that breaks this.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (BEAM-7871) Streaming from PubSub to Firestore doesn't work on Dataflow

2019-08-14 Thread Bruce Arctor (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-7871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907413#comment-16907413
 ] 

Bruce Arctor commented on BEAM-7871:


[~zdenulo], is this due to lack of there being an IO to firebase api (firestore 
in native mode)?  I do see that there is a datastore connector, so was 
wondering if that works with firestore in datastore mode.  Haven't tried yet, 
but this is a task I was hoping to accomplish (pubsub -> beam/dataflow -> 
firestore in native mode, meaning firebase api) – just starting to look into it 
which led me here.  

> Streaming from PubSub to Firestore doesn't work on Dataflow
> ---
>
> Key: BEAM-7871
> URL: https://issues.apache.org/jira/browse/BEAM-7871
> Project: Beam
>  Issue Type: Bug
>  Components: io-py-gcp, runner-dataflow
>Affects Versions: 2.13.0
>Reporter: Zdenko Hrcek
>Priority: Major
>
> I came to the same error as here 
> [https://stackoverflow.com/questions/57059944/python-package-errors-while-running-gcp-dataflow]
>  but I don't see anywhere reported thus I am creating an issue just in case.
> The pipeline is quite simple, reading from PubSub and writing to Firestore.
> Beam version used is 2.13.0, Python 2.7
> With DirectRunner works ok, but on Dataflow it throws the following message:
>  
> {code:java}
> java.util.concurrent.ExecutionException: java.lang.RuntimeException: Error 
> received from SDK harness for instruction -81: Traceback (most recent call 
> last):
>  File 
> "/usr/local/lib/python2.7/dist-packages/apache_beam/runners/worker/sdk_worker.py",
>  line 157, in _execute
>  response = task()
>  File 
> "/usr/local/lib/python2.7/dist-packages/apache_beam/runners/worker/sdk_worker.py",
>  line 190, in 
>  self._execute(lambda: worker.do_instruction(work), work)
>  File 
> "/usr/local/lib/python2.7/dist-packages/apache_beam/runners/worker/sdk_worker.py",
>  line 312, in do_instruction
>  request.instruction_id)
>  File 
> "/usr/local/lib/python2.7/dist-packages/apache_beam/runners/worker/sdk_worker.py",
>  line 331, in process_bundle
>  bundle_processor.process_bundle(instruction_id))
>  File 
> "/usr/local/lib/python2.7/dist-packages/apache_beam/runners/worker/bundle_processor.py",
>  line 538, in process_bundle
>  op.start()
>  File "apache_beam/runners/worker/operations.py", line 554, in 
> apache_beam.runners.worker.operations.DoOperation.start
>  def start(self):
>  File "apache_beam/runners/worker/operations.py", line 555, in 
> apache_beam.runners.worker.operations.DoOperation.start
>  with self.scoped_start_state:
>  File "apache_beam/runners/worker/operations.py", line 557, in 
> apache_beam.runners.worker.operations.DoOperation.start
>  self.dofn_runner.start()
>  File "apache_beam/runners/common.py", line 778, in 
> apache_beam.runners.common.DoFnRunner.start
>  self._invoke_bundle_method(self.do_fn_invoker.invoke_start_bundle)
>  File "apache_beam/runners/common.py", line 775, in 
> apache_beam.runners.common.DoFnRunner._invoke_bundle_method
>  self._reraise_augmented(exn)
>  File "apache_beam/runners/common.py", line 800, in 
> apache_beam.runners.common.DoFnRunner._reraise_augmented
>  raise_with_traceback(new_exn)
>  File "apache_beam/runners/common.py", line 773, in 
> apache_beam.runners.common.DoFnRunner._invoke_bundle_method
>  bundle_method()
>  File "apache_beam/runners/common.py", line 359, in 
> apache_beam.runners.common.DoFnInvoker.invoke_start_bundle
>  def invoke_start_bundle(self):
>  File "apache_beam/runners/common.py", line 363, in 
> apache_beam.runners.common.DoFnInvoker.invoke_start_bundle
>  self.signature.start_bundle_method.method_value())
>  File 
> "/home/zdenulo/dev/gcp_stuff/df_firestore_stream/df_firestore_stream.py", 
> line 39, in start_bundle
> NameError: global name 'firestore' is not defined [while running 
> 'generatedPtransform-64']
>  
> {code}
> It's interesting that using Beam version 2.12.0 solves the problem on 
> Dataflow, it works as expected, not sure what could be the problem.
> Here is a repository with the code which was used 
> [https://github.com/zdenulo/dataflow_firestore_stream]
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-5164) ParquetIOIT fails on Spark and Flink

2019-08-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-5164?focusedWorklogId=294827=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-294827
 ]

ASF GitHub Bot logged work on BEAM-5164:


Author: ASF GitHub Bot
Created on: 14/Aug/19 16:18
Start Date: 14/Aug/19 16:18
Worklog Time Spent: 10m 
  Work Description: RyanSkraba commented on pull request #9339: 
[BEAM-5164]: Add documentation for ParquetIO.
URL: https://github.com/apache/beam/pull/9339#discussion_r313963888
 
 

 ##
 File path: website/src/documentation/io/built-in-parquet.md
 ##
 @@ -0,0 +1,141 @@
+---
+layout: section
+title: "Apache Parquet I/O connector"
+section_menu: section-menu/documentation.html
+permalink: /documentation/io/built-in/parquet/
+---
+
+
+[Built-in I/O Transforms]({{site.baseurl}}/documentation/io/built-in/)
+
+# Apache Parquet I/O connector
+
+
+  Adapt for:
+  
+Java SDK
+Python SDK
+  
+
+
+The Beam SDKs include built-in transforms that can read data from and write 
data
+to [Apache Parquet](https://parquet.apache.org) files.
+
+## Before you start
+
+
+
+{:.language-java}
+To use ParquetIO, add the Maven artifact dependency to your `pom.xml` file.
+
+```java
+
+org.apache.beam
+beam-sdks-java-io-parquet
+{{ site.release_latest }}
+
+```
+
+{:.language-java}
+Additional resources:
+
+{:.language-java}
+* [ParquetIO source 
code](https://github.com/apache/beam/blob/master/sdks/java/io/parquet/src/main/java/org/apache/beam/sdk/io/parquet/ParquetIO.java)
+* [ParquetIO Javadoc](https://beam.apache.org/releases/javadoc/{{ 
site.release_latest }}/org/apache/beam/sdk/io/parquet/ParquetIO.html)
+
+
+
+{:.language-py}
+ParquetIO comes preinstalled with the Apache Beam python sdk..
+
+{:.language-py}
+Additional resources:
+
+{:.language-py}
+* [ParquetIO source 
code](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/parquetio.py)
+* [ParquetIO Pydoc](https://beam.apache.org/releases/pydoc/{{ 
site.release_latest }}/apache_beam.io.parquetio.html)
+
+
+### Using ParquetIO with Spark before 2.4
 
 Review comment:
   Neat!  I didn't know about that site -- I'll take a careful look.
   
   I copied the structure from the hcatalog which has the same issue...  I'll 
fix this one (later this week unfortunately) and apply it to hcatalog in the 
future.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 294827)
Time Spent: 40m  (was: 0.5h)

> ParquetIOIT fails on Spark and Flink
> 
>
> Key: BEAM-5164
> URL: https://issues.apache.org/jira/browse/BEAM-5164
> Project: Beam
>  Issue Type: Bug
>  Components: testing
>Reporter: Lukasz Gajowy
>Priority: Minor
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> When run on Spark or Flink remote cluster, ParquetIOIT fails with the 
> following stacktrace: 
> {code:java}
> org.apache.beam.sdk.io.parquet.ParquetIOIT > writeThenReadAll FAILED
> org.apache.beam.sdk.Pipeline$PipelineExecutionException: 
> java.lang.NoSuchMethodError: 
> org.apache.parquet.hadoop.ParquetWriter$Builder.(Lorg/apache/parquet/io/OutputFile;)V
> at 
> org.apache.beam.runners.spark.SparkPipelineResult.beamExceptionFrom(SparkPipelineResult.java:66)
> at 
> org.apache.beam.runners.spark.SparkPipelineResult.waitUntilFinish(SparkPipelineResult.java:99)
> at 
> org.apache.beam.runners.spark.SparkPipelineResult.waitUntilFinish(SparkPipelineResult.java:87)
> at org.apache.beam.runners.spark.TestSparkRunner.run(TestSparkRunner.java:116)
> at org.apache.beam.runners.spark.TestSparkRunner.run(TestSparkRunner.java:61)
> at org.apache.beam.sdk.Pipeline.run(Pipeline.java:313)
> at org.apache.beam.sdk.testing.TestPipeline.run(TestPipeline.java:350)
> at org.apache.beam.sdk.testing.TestPipeline.run(TestPipeline.java:331)
> at 
> org.apache.beam.sdk.io.parquet.ParquetIOIT.writeThenReadAll(ParquetIOIT.java:133)
> Caused by:
> java.lang.NoSuchMethodError: 
> org.apache.parquet.hadoop.ParquetWriter$Builder.(Lorg/apache/parquet/io/OutputFile;)V{code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7520) DirectRunner timers are not strictly time ordered

2019-08-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7520?focusedWorklogId=294824=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-294824
 ]

ASF GitHub Bot logged work on BEAM-7520:


Author: ASF GitHub Bot
Created on: 14/Aug/19 16:07
Start Date: 14/Aug/19 16:07
Worklog Time Spent: 10m 
  Work Description: je-ik commented on issue #9190: [BEAM-7520] Fix timer 
firing order in DirectRunner
URL: https://github.com/apache/beam/pull/9190#issuecomment-521311591
 
 
   @kennknowles would you have time to look into this, please?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 294824)
Time Spent: 3h 50m  (was: 3h 40m)

> DirectRunner timers are not strictly time ordered
> -
>
> Key: BEAM-7520
> URL: https://issues.apache.org/jira/browse/BEAM-7520
> Project: Beam
>  Issue Type: Bug
>  Components: runner-direct
>Affects Versions: 2.13.0
>Reporter: Jan Lukavský
>Assignee: Jan Lukavský
>Priority: Major
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> Let's suppose we have the following situation:
>  - statful ParDo with two timers - timerA and timerB
>  - timerA is set for window.maxTimestamp() + 1
>  - timerB is set anywhere between  timerB.timestamp
>  - input watermark moves to BoundedWindow.TIMESTAMP_MAX_VALUE
> Then the order of timers is as follows (correct):
>  - timerB
>  - timerA
> But, if timerB sets another timer (say for timerB.timestamp + 1), then the 
> order of timers will be:
>  - timerB (timerB.timestamp)
>  - timerA (BoundedWindow.TIMESTAMP_MAX_VALUE)
>  - timerB (timerB.timestamp + 1)
> Which is not ordered by timestamp. The reason for this is that when the input 
> watermark update is evaluated, the WatermarkManager,extractFiredTimers() will 
> produce both timerA and timerB. That would be correct, but when timerB sets 
> another timer, that breaks this.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7386) Add Utility BiTemporalStreamJoin

2019-08-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7386?focusedWorklogId=294816=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-294816
 ]

ASF GitHub Bot logged work on BEAM-7386:


Author: ASF GitHub Bot
Created on: 14/Aug/19 15:54
Start Date: 14/Aug/19 15:54
Worklog Time Spent: 10m 
  Work Description: je-ik commented on pull request #9032: [BEAM-7386] 
Bi-Temporal Join
URL: https://github.com/apache/beam/pull/9032#discussion_r313952183
 
 

 ##
 File path: 
sdks/java/extensions/timeseries/src/main/java/org/apache/beam/sdk/extensions/timeseries/joins/BiTemporalStreams.java
 ##
 @@ -0,0 +1,464 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.beam.sdk.extensions.timeseries.joins;
+
+import java.util.*;
+import org.apache.beam.sdk.annotations.Experimental;
+import org.apache.beam.sdk.coders.Coder;
+import org.apache.beam.sdk.coders.KvCoder;
+import org.apache.beam.sdk.coders.VarLongCoder;
+import org.apache.beam.sdk.state.*;
+import org.apache.beam.sdk.state.Timer;
+import org.apache.beam.sdk.transforms.DoFn;
+import org.apache.beam.sdk.transforms.Flatten;
+import org.apache.beam.sdk.transforms.PTransform;
+import org.apache.beam.sdk.transforms.ParDo;
+import org.apache.beam.sdk.transforms.join.KeyedPCollectionTuple;
+import org.apache.beam.sdk.transforms.windowing.BoundedWindow;
+import org.apache.beam.sdk.transforms.windowing.FixedWindows;
+import org.apache.beam.sdk.transforms.windowing.PaneInfo;
+import org.apache.beam.sdk.transforms.windowing.Window;
+import org.apache.beam.sdk.util.WindowedValue;
+import org.apache.beam.sdk.values.*;
+import org.joda.time.Duration;
+import org.joda.time.Instant;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * This sample will take two streams left stream and right stream. It will 
match the left stream to
+ * the nearest right stream based on their timestamp. Nearest left stream is 
either <= to the right
+ * stream timestamp.
+ *
+ * If two values in the right steam have the same timestamp then the 
results are
+ * non-deterministic.
+ */
+@Experimental
+public class BiTemporalStreams {
+
+  public static  BiTemporalJoin join(
+  TupleTag leftTag, TupleTag rightTag, Duration window) {
+return new BiTemporalJoin<>(leftTag, rightTag, window);
+  }
+
+  @Experimental
+  public static class BiTemporalJoin
+  extends PTransform, 
PCollection>> {
+
+private static final Logger LOG = 
LoggerFactory.getLogger(BiTemporalStreams.class);
+
+// Sets the limit at which point the processed left stream values are 
garbage collected
+static int LEFT_STREAM_GC_LIMIT = 1000;
+
+Coder leftCoder;
+Coder rightCoder;
+
+TupleTag leftTag;
+TupleTag rightTag;
+
+Duration window;
+
+private BiTemporalJoin(TupleTag leftTag, TupleTag rightTag, 
Duration window) {
+  this.leftTag = leftTag;
+  this.rightTag = rightTag;
+  this.window = window;
+}
+
+public static  BiTemporalJoin create(
+TupleTag leftTag, TupleTag rightTag, Duration window) {
+  return new BiTemporalJoin(leftTag, rightTag, window);
+}
+
+public BiTemporalJoin setGCLimit(int gcLimit) {
+  LEFT_STREAM_GC_LIMIT = gcLimit;
+  return this;
+}
+
+@Override
+public PCollection> 
expand(KeyedPCollectionTuple input) {
+
+  List> collections =
+  input.getKeyedCollections();
+
+  PCollection> leftCollection = null;
+  PCollection> rightCollection = null;
+
+  for (KeyedPCollectionTuple.TaggedKeyedPCollection t : collections) 
{
+
+if (t.getTupleTag().equals(leftTag)) {
+  leftCollection =
+  ((KeyedPCollectionTuple.TaggedKeyedPCollection) 
t).getCollection();
+}
+
+if (t.getTupleTag().equals(rightTag)) {
+  rightCollection =
+  ((KeyedPCollectionTuple.TaggedKeyedPCollection) 
t).getCollection();
+}
+  }
+
+  leftCoder = ((KvCoder) leftCollection.getCoder()).getValueCoder();
+  rightCoder = ((KvCoder) 
rightCollection.getCoder()).getValueCoder();
+
+  BiTemporalJoinResultCoder 

[jira] [Work logged] (BEAM-7386) Add Utility BiTemporalStreamJoin

2019-08-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7386?focusedWorklogId=294807=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-294807
 ]

ASF GitHub Bot logged work on BEAM-7386:


Author: ASF GitHub Bot
Created on: 14/Aug/19 15:28
Start Date: 14/Aug/19 15:28
Worklog Time Spent: 10m 
  Work Description: rezarokni commented on pull request #9032: [BEAM-7386] 
Bi-Temporal Join
URL: https://github.com/apache/beam/pull/9032#discussion_r313938990
 
 

 ##
 File path: 
sdks/java/extensions/timeseries/src/main/java/org/apache/beam/sdk/extensions/timeseries/joins/BiTemporalStreams.java
 ##
 @@ -0,0 +1,464 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.beam.sdk.extensions.timeseries.joins;
+
+import java.util.*;
+import org.apache.beam.sdk.annotations.Experimental;
+import org.apache.beam.sdk.coders.Coder;
+import org.apache.beam.sdk.coders.KvCoder;
+import org.apache.beam.sdk.coders.VarLongCoder;
+import org.apache.beam.sdk.state.*;
+import org.apache.beam.sdk.state.Timer;
+import org.apache.beam.sdk.transforms.DoFn;
+import org.apache.beam.sdk.transforms.Flatten;
+import org.apache.beam.sdk.transforms.PTransform;
+import org.apache.beam.sdk.transforms.ParDo;
+import org.apache.beam.sdk.transforms.join.KeyedPCollectionTuple;
+import org.apache.beam.sdk.transforms.windowing.BoundedWindow;
+import org.apache.beam.sdk.transforms.windowing.FixedWindows;
+import org.apache.beam.sdk.transforms.windowing.PaneInfo;
+import org.apache.beam.sdk.transforms.windowing.Window;
+import org.apache.beam.sdk.util.WindowedValue;
+import org.apache.beam.sdk.values.*;
+import org.joda.time.Duration;
+import org.joda.time.Instant;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * This sample will take two streams left stream and right stream. It will 
match the left stream to
+ * the nearest right stream based on their timestamp. Nearest left stream is 
either <= to the right
+ * stream timestamp.
+ *
+ * If two values in the right steam have the same timestamp then the 
results are
+ * non-deterministic.
+ */
+@Experimental
+public class BiTemporalStreams {
+
+  public static  BiTemporalJoin join(
+  TupleTag leftTag, TupleTag rightTag, Duration window) {
+return new BiTemporalJoin<>(leftTag, rightTag, window);
+  }
+
+  @Experimental
+  public static class BiTemporalJoin
+  extends PTransform, 
PCollection>> {
+
+private static final Logger LOG = 
LoggerFactory.getLogger(BiTemporalStreams.class);
+
+// Sets the limit at which point the processed left stream values are 
garbage collected
+static int LEFT_STREAM_GC_LIMIT = 1000;
+
+Coder leftCoder;
+Coder rightCoder;
+
+TupleTag leftTag;
+TupleTag rightTag;
+
+Duration window;
+
+private BiTemporalJoin(TupleTag leftTag, TupleTag rightTag, 
Duration window) {
+  this.leftTag = leftTag;
+  this.rightTag = rightTag;
+  this.window = window;
+}
+
+public static  BiTemporalJoin create(
+TupleTag leftTag, TupleTag rightTag, Duration window) {
+  return new BiTemporalJoin(leftTag, rightTag, window);
+}
+
+public BiTemporalJoin setGCLimit(int gcLimit) {
+  LEFT_STREAM_GC_LIMIT = gcLimit;
+  return this;
+}
+
+@Override
+public PCollection> 
expand(KeyedPCollectionTuple input) {
+
+  List> collections =
+  input.getKeyedCollections();
+
+  PCollection> leftCollection = null;
+  PCollection> rightCollection = null;
+
+  for (KeyedPCollectionTuple.TaggedKeyedPCollection t : collections) 
{
+
+if (t.getTupleTag().equals(leftTag)) {
+  leftCollection =
+  ((KeyedPCollectionTuple.TaggedKeyedPCollection) 
t).getCollection();
+}
+
+if (t.getTupleTag().equals(rightTag)) {
+  rightCollection =
+  ((KeyedPCollectionTuple.TaggedKeyedPCollection) 
t).getCollection();
+}
+  }
+
+  leftCoder = ((KvCoder) leftCollection.getCoder()).getValueCoder();
+  rightCoder = ((KvCoder) 
rightCollection.getCoder()).getValueCoder();
+
+  BiTemporalJoinResultCoder 

[jira] [Updated] (BEAM-7979) Avro incompatibilities with Spark 2.2 and Spark 2.3

2019-08-14 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/BEAM-7979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ismaël Mejía updated BEAM-7979:
---
Status: Open  (was: Triage Needed)

> Avro incompatibilities with Spark 2.2 and Spark 2.3
> ---
>
> Key: BEAM-7979
> URL: https://issues.apache.org/jira/browse/BEAM-7979
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp, io-java-parquet, sdk-java-core
>Reporter: Ryan Skraba
>Priority: Major
>
> Much of the code that depends on Avro (notably the wrappers built with 
> [BeamSQL|https://github.com/apache/beam/blob/ae83448597f64474c3f5754d7b8e3f6b02347a6b/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/utils/AvroUtils.java#L34]
>  but also 
> [some|https://github.com/apache/beam/blob/ae83448597f64474c3f5754d7b8e3f6b02347a6b/sdks/java/io/parquet/src/main/java/org/apache/beam/sdk/io/parquet/ParquetIO.java]
>  
> [connectors|https://github.com/apache/beam/blob/ae83448597f64474c3f5754d7b8e3f6b02347a6b/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryAvroUtils.java#L42])
>  require a version > 1.8.x
> This library is not present in Spark 2.2 and Spark 2.3 clusters, which are 
> meant to be supported.  These pipelines will fail with ClassNotFoundException 
> / MethodNotFoundExceptions.
> Spark 2.4+ should be unaffected.
> Relocating or vendoring is probably not appropriate, since Avro is frequently 
> exposed in the API through parameters and potentially in generated specific 
> records.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (BEAM-7979) Avro incompatibilities with Spark 2.2 and Spark 2.3

2019-08-14 Thread Ryan Skraba (JIRA)
Ryan Skraba created BEAM-7979:
-

 Summary: Avro incompatibilities with Spark 2.2 and Spark 2.3
 Key: BEAM-7979
 URL: https://issues.apache.org/jira/browse/BEAM-7979
 Project: Beam
  Issue Type: Bug
  Components: io-java-gcp, io-java-parquet, sdk-java-core
Reporter: Ryan Skraba


Much of the code that depends on Avro (notably the wrappers built with 
[BeamSQL|https://github.com/apache/beam/blob/ae83448597f64474c3f5754d7b8e3f6b02347a6b/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/utils/AvroUtils.java#L34]
 but also 
[some|https://github.com/apache/beam/blob/ae83448597f64474c3f5754d7b8e3f6b02347a6b/sdks/java/io/parquet/src/main/java/org/apache/beam/sdk/io/parquet/ParquetIO.java]
 
[connectors|https://github.com/apache/beam/blob/ae83448597f64474c3f5754d7b8e3f6b02347a6b/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryAvroUtils.java#L42])
 require a version > 1.8.x

This library is not present in Spark 2.2 and Spark 2.3 clusters, which are 
meant to be supported.  These pipelines will fail with ClassNotFoundException / 
MethodNotFoundExceptions.

Spark 2.4+ should be unaffected.

Relocating or vendoring is probably not appropriate, since Avro is frequently 
exposed in the API through parameters and potentially in generated specific 
records.





--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (BEAM-5164) ParquetIOIT fails on Spark and Flink

2019-08-14 Thread Lukasz Gajowy (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-5164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907334#comment-16907334
 ] 

Lukasz Gajowy commented on BEAM-5164:
-

I'm fine with adding the docs as you suggested. I left comments in your PR.

> ParquetIOIT fails on Spark and Flink
> 
>
> Key: BEAM-5164
> URL: https://issues.apache.org/jira/browse/BEAM-5164
> Project: Beam
>  Issue Type: Bug
>  Components: testing
>Reporter: Lukasz Gajowy
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When run on Spark or Flink remote cluster, ParquetIOIT fails with the 
> following stacktrace: 
> {code:java}
> org.apache.beam.sdk.io.parquet.ParquetIOIT > writeThenReadAll FAILED
> org.apache.beam.sdk.Pipeline$PipelineExecutionException: 
> java.lang.NoSuchMethodError: 
> org.apache.parquet.hadoop.ParquetWriter$Builder.(Lorg/apache/parquet/io/OutputFile;)V
> at 
> org.apache.beam.runners.spark.SparkPipelineResult.beamExceptionFrom(SparkPipelineResult.java:66)
> at 
> org.apache.beam.runners.spark.SparkPipelineResult.waitUntilFinish(SparkPipelineResult.java:99)
> at 
> org.apache.beam.runners.spark.SparkPipelineResult.waitUntilFinish(SparkPipelineResult.java:87)
> at org.apache.beam.runners.spark.TestSparkRunner.run(TestSparkRunner.java:116)
> at org.apache.beam.runners.spark.TestSparkRunner.run(TestSparkRunner.java:61)
> at org.apache.beam.sdk.Pipeline.run(Pipeline.java:313)
> at org.apache.beam.sdk.testing.TestPipeline.run(TestPipeline.java:350)
> at org.apache.beam.sdk.testing.TestPipeline.run(TestPipeline.java:331)
> at 
> org.apache.beam.sdk.io.parquet.ParquetIOIT.writeThenReadAll(ParquetIOIT.java:133)
> Caused by:
> java.lang.NoSuchMethodError: 
> org.apache.parquet.hadoop.ParquetWriter$Builder.(Lorg/apache/parquet/io/OutputFile;)V{code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-5164) ParquetIOIT fails on Spark and Flink

2019-08-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-5164?focusedWorklogId=294800=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-294800
 ]

ASF GitHub Bot logged work on BEAM-5164:


Author: ASF GitHub Bot
Created on: 14/Aug/19 15:04
Start Date: 14/Aug/19 15:04
Worklog Time Spent: 10m 
  Work Description: lgajowy commented on pull request #9339: [BEAM-5164]: 
Add documentation for ParquetIO.
URL: https://github.com/apache/beam/pull/9339#discussion_r313920365
 
 

 ##
 File path: website/src/documentation/io/built-in-parquet.md
 ##
 @@ -0,0 +1,141 @@
+---
+layout: section
+title: "Apache Parquet I/O connector"
+section_menu: section-menu/documentation.html
+permalink: /documentation/io/built-in/parquet/
+---
+
+
+[Built-in I/O Transforms]({{site.baseurl}}/documentation/io/built-in/)
+
+# Apache Parquet I/O connector
+
+
+  Adapt for:
+  
+Java SDK
+Python SDK
+  
+
+
+The Beam SDKs include built-in transforms that can read data from and write 
data
+to [Apache Parquet](https://parquet.apache.org) files.
+
+## Before you start
+
+
+
+{:.language-java}
+To use ParquetIO, add the Maven artifact dependency to your `pom.xml` file.
+
+```java
+
+org.apache.beam
+beam-sdks-java-io-parquet
+{{ site.release_latest }}
+
+```
+
+{:.language-java}
+Additional resources:
+
+{:.language-java}
+* [ParquetIO source 
code](https://github.com/apache/beam/blob/master/sdks/java/io/parquet/src/main/java/org/apache/beam/sdk/io/parquet/ParquetIO.java)
+* [ParquetIO Javadoc](https://beam.apache.org/releases/javadoc/{{ 
site.release_latest }}/org/apache/beam/sdk/io/parquet/ParquetIO.html)
+
+
+
+{:.language-py}
+ParquetIO comes preinstalled with the Apache Beam python sdk..
+
+{:.language-py}
+Additional resources:
+
+{:.language-py}
+* [ParquetIO source 
code](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/parquetio.py)
+* [ParquetIO Pydoc](https://beam.apache.org/releases/pydoc/{{ 
site.release_latest }}/apache_beam.io.parquetio.html)
+
+
+### Using ParquetIO with Spark before 2.4
 
 Review comment:
   https://user-images.githubusercontent.com/1932045/63031334-26736c00-beb4-11e9-9e12-941c9d1a4daf.png;>
   
   I mean this switch.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 294800)
Time Spent: 0.5h  (was: 20m)

> ParquetIOIT fails on Spark and Flink
> 
>
> Key: BEAM-5164
> URL: https://issues.apache.org/jira/browse/BEAM-5164
> Project: Beam
>  Issue Type: Bug
>  Components: testing
>Reporter: Lukasz Gajowy
>Priority: Minor
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When run on Spark or Flink remote cluster, ParquetIOIT fails with the 
> following stacktrace: 
> {code:java}
> org.apache.beam.sdk.io.parquet.ParquetIOIT > writeThenReadAll FAILED
> org.apache.beam.sdk.Pipeline$PipelineExecutionException: 
> java.lang.NoSuchMethodError: 
> org.apache.parquet.hadoop.ParquetWriter$Builder.(Lorg/apache/parquet/io/OutputFile;)V
> at 
> org.apache.beam.runners.spark.SparkPipelineResult.beamExceptionFrom(SparkPipelineResult.java:66)
> at 
> org.apache.beam.runners.spark.SparkPipelineResult.waitUntilFinish(SparkPipelineResult.java:99)
> at 
> org.apache.beam.runners.spark.SparkPipelineResult.waitUntilFinish(SparkPipelineResult.java:87)
> at org.apache.beam.runners.spark.TestSparkRunner.run(TestSparkRunner.java:116)
> at org.apache.beam.runners.spark.TestSparkRunner.run(TestSparkRunner.java:61)
> at org.apache.beam.sdk.Pipeline.run(Pipeline.java:313)
> at org.apache.beam.sdk.testing.TestPipeline.run(TestPipeline.java:350)
> at org.apache.beam.sdk.testing.TestPipeline.run(TestPipeline.java:331)
> at 
> org.apache.beam.sdk.io.parquet.ParquetIOIT.writeThenReadAll(ParquetIOIT.java:133)
> Caused by:
> java.lang.NoSuchMethodError: 
> org.apache.parquet.hadoop.ParquetWriter$Builder.(Lorg/apache/parquet/io/OutputFile;)V{code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-5164) ParquetIOIT fails on Spark and Flink

2019-08-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-5164?focusedWorklogId=294799=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-294799
 ]

ASF GitHub Bot logged work on BEAM-5164:


Author: ASF GitHub Bot
Created on: 14/Aug/19 15:04
Start Date: 14/Aug/19 15:04
Worklog Time Spent: 10m 
  Work Description: lgajowy commented on pull request #9339: [BEAM-5164]: 
Add documentation for ParquetIO.
URL: https://github.com/apache/beam/pull/9339#discussion_r313920065
 
 

 ##
 File path: website/src/documentation/io/built-in-parquet.md
 ##
 @@ -0,0 +1,141 @@
+---
+layout: section
+title: "Apache Parquet I/O connector"
+section_menu: section-menu/documentation.html
+permalink: /documentation/io/built-in/parquet/
+---
+
+
+[Built-in I/O Transforms]({{site.baseurl}}/documentation/io/built-in/)
+
+# Apache Parquet I/O connector
+
+
+  Adapt for:
+  
+Java SDK
+Python SDK
+  
+
+
+The Beam SDKs include built-in transforms that can read data from and write 
data
+to [Apache Parquet](https://parquet.apache.org) files.
+
+## Before you start
+
+
+
+{:.language-java}
+To use ParquetIO, add the Maven artifact dependency to your `pom.xml` file.
+
+```java
+
+org.apache.beam
+beam-sdks-java-io-parquet
+{{ site.release_latest }}
+
+```
+
+{:.language-java}
+Additional resources:
+
+{:.language-java}
+* [ParquetIO source 
code](https://github.com/apache/beam/blob/master/sdks/java/io/parquet/src/main/java/org/apache/beam/sdk/io/parquet/ParquetIO.java)
+* [ParquetIO Javadoc](https://beam.apache.org/releases/javadoc/{{ 
site.release_latest }}/org/apache/beam/sdk/io/parquet/ParquetIO.html)
+
+
+
+{:.language-py}
+ParquetIO comes preinstalled with the Apache Beam python sdk..
+
+{:.language-py}
+Additional resources:
+
+{:.language-py}
+* [ParquetIO source 
code](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/parquetio.py)
+* [ParquetIO Pydoc](https://beam.apache.org/releases/pydoc/{{ 
site.release_latest }}/apache_beam.io.parquetio.html)
+
+
+### Using ParquetIO with Spark before 2.4
 
 Review comment:
   Good idea to document this.
   
   I took a look at the stage page: 
http://apache-beam-website-pull-requests.storage.googleapis.com/9339/documentation/io/built-in/parquet/index.html
   
   when I choose "adapt for Python SDK" I still see instructions for maven. It 
is not intentional, right?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 294799)
Time Spent: 20m  (was: 10m)

> ParquetIOIT fails on Spark and Flink
> 
>
> Key: BEAM-5164
> URL: https://issues.apache.org/jira/browse/BEAM-5164
> Project: Beam
>  Issue Type: Bug
>  Components: testing
>Reporter: Lukasz Gajowy
>Priority: Minor
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> When run on Spark or Flink remote cluster, ParquetIOIT fails with the 
> following stacktrace: 
> {code:java}
> org.apache.beam.sdk.io.parquet.ParquetIOIT > writeThenReadAll FAILED
> org.apache.beam.sdk.Pipeline$PipelineExecutionException: 
> java.lang.NoSuchMethodError: 
> org.apache.parquet.hadoop.ParquetWriter$Builder.(Lorg/apache/parquet/io/OutputFile;)V
> at 
> org.apache.beam.runners.spark.SparkPipelineResult.beamExceptionFrom(SparkPipelineResult.java:66)
> at 
> org.apache.beam.runners.spark.SparkPipelineResult.waitUntilFinish(SparkPipelineResult.java:99)
> at 
> org.apache.beam.runners.spark.SparkPipelineResult.waitUntilFinish(SparkPipelineResult.java:87)
> at org.apache.beam.runners.spark.TestSparkRunner.run(TestSparkRunner.java:116)
> at org.apache.beam.runners.spark.TestSparkRunner.run(TestSparkRunner.java:61)
> at org.apache.beam.sdk.Pipeline.run(Pipeline.java:313)
> at org.apache.beam.sdk.testing.TestPipeline.run(TestPipeline.java:350)
> at org.apache.beam.sdk.testing.TestPipeline.run(TestPipeline.java:331)
> at 
> org.apache.beam.sdk.io.parquet.ParquetIOIT.writeThenReadAll(ParquetIOIT.java:133)
> Caused by:
> java.lang.NoSuchMethodError: 
> org.apache.parquet.hadoop.ParquetWriter$Builder.(Lorg/apache/parquet/io/OutputFile;)V{code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Work logged] (BEAM-7772) Stop using Perfkit Benchmarker tool in all tests

2019-08-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7772?focusedWorklogId=294791=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-294791
 ]

ASF GitHub Bot logged work on BEAM-7772:


Author: ASF GitHub Bot
Created on: 14/Aug/19 14:48
Start Date: 14/Aug/19 14:48
Worklog Time Spent: 10m 
  Work Description: kkucharc commented on pull request #9315: [BEAM-7772] 
remove pkb from JDBCIOIT, HadoopFormatIOIT and hdfs tests
URL: https://github.com/apache/beam/pull/9315#discussion_r313916640
 
 

 ##
 File path: .test-infra/jenkins/job_PerformanceTests_FileBasedIO_IT.groovy
 ##
 @@ -149,3 +246,59 @@ private void createFileBasedIOITTestJob(testJob) {
 }
   }
 }
+
+jobs.findAll {
+  it.name in [
+  'beam_PerformanceTests_TextIOIT_HDFS',
+  'beam_PerformanceTests_Compressed_TextIOIT_HDFS',
+  'beam_PerformanceTests_ManyFiles_TextIOIT_HDFS',
+  // TODO(BEAM-3945) TFRecord performance test is failing only when 
running on hdfs.
+  // We need to fix this before enabling this job on jenkins.
+  //'beam_PerformanceTests_TFRecordIOIT_HDFS',
+  'beam_PerformanceTests_AvroIOIT_HDFS',
+  'beam_PerformanceTests_XmlIOIT_HDFS',
+  'beam_PerformanceTests_ParquetIOIT_HDFS'
+  ]
+}.forEach { testJob -> createHDFSFileBasedIOITTestJob(testJob) }
+
+private void createHDFSFileBasedIOITTestJob(testJob) {
+  job(testJob.name) {
+description(testJob.description)
+common.setTopLevelMainJobProperties(delegate)
+common.enablePhraseTriggeringFromPullRequest(delegate, 
testJob.githubTitle, testJob.githubTriggerPhrase)
+common.setAutoJob(delegate, 'H */6 * * *')
+
+String namespace = common.getKubernetesNamespace(testJob.name)
+String kubeconfig = common.getKubeconfigLocationForNamespace(namespace)
+Kubernetes k8s = Kubernetes.create(delegate, kubeconfig, namespace)
+
+
k8s.apply(common.makePathAbsolute("src/.test-infra/kubernetes/hadoop/LargeITCluster/hdfs-multi-datanode-cluster.yml"))
+String hostName = "LOAD_BALANCER_IP"
+k8s.loadBalancerIP("hadoop", hostName)
+
+Map additionalOptions = [
+runner   : 'DataflowRunner',
+project  : 'apache-beam-testing',
+tempRoot : 'gs://temp-storage-for-perf-tests',
+hdfsConfiguration: 
/[{\\\"fs.defaultFS\\\":\\\"hdfs:$${hostName}:9000\\\",\\\"dfs.replication\\\":1}]/,
 
 Review comment:
   I am impressed by this escaping 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 294791)
Time Spent: 3h 20m  (was: 3h 10m)

> Stop using Perfkit Benchmarker tool in all tests
> 
>
> Key: BEAM-7772
> URL: https://issues.apache.org/jira/browse/BEAM-7772
> Project: Beam
>  Issue Type: Improvement
>  Components: testing
>Reporter: Lukasz Gajowy
>Assignee: Lukasz Gajowy
>Priority: Major
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> [Devlist thread 
> link|https://lists.apache.org/thread.html/dab1c093799248787e8b75e63b66d7389b594b649a4d9a4a5db1cfbb@%3Cdev.beam.apache.org%3E]
>  
> Currently Python, IOIT and some Dataflow and Spark performance tests are 
> relying on Perfkit Benchmarker tool. Due to the reasons discussed on the 
> devlist it was decided to remove it from Beam's tests. 
> Problems that we face currently:
>  # Changes to Gradle tasks/build configuration in the Beam codebase have to 
> be reflected in Perfkit code. This required PRs to Perfkit which can last and 
> the tests break due to this sometimes (no change in Perfkit + change already 
> there in beam = incompatibility). This is what happened in PR 8919 (above),
>  # Can't run in Python3 (depends on python 2 only library like functools32),
>  # Black box testing which hard to collect pipeline related metrics,
>  # Measurement of run time is inaccurate,
>  # It offers relatively small elasticity in comparison with eg. Jenkins tasks 
> in terms of setting up the testing infrastructure (runners, databases). For 
> example, if we'd like to setup Flink runner, and reuse it in consequent tests 
> in one go, that would be impossible. We can easily do this in Jenkins.
> Tests that use Perfkit:
>  # IO integration tests,
>  # Python performance tests,
>  # beam_PerformanceTests_Dataflow (disabled),
>  # beam_PerformanceTests_Spark (failing constantly - looks not maintained).
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


  1   2   >