[jira] [Work logged] (BEAM-8344) Add infer schema support in ParquetIO and refactor ParquetTableProvider

2019-10-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8344?focusedWorklogId=333070=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-333070
 ]

ASF GitHub Bot logged work on BEAM-8344:


Author: ASF GitHub Bot
Created on: 24/Oct/19 04:31
Start Date: 24/Oct/19 04:31
Worklog Time Spent: 10m 
  Work Description: bmv126 commented on issue #9721: [BEAM-8344] Add 
inferSchema support in ParquetIO and refactor ParquetTableProvider
URL: https://github.com/apache/beam/pull/9721#issuecomment-545739303
 
 
   R: @amaliujia R: @reuvenlax R: @jbonofre 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 333070)
Time Spent: 1h  (was: 50m)

> Add infer schema support in ParquetIO and refactor ParquetTableProvider
> ---
>
> Key: BEAM-8344
> URL: https://issues.apache.org/jira/browse/BEAM-8344
> Project: Beam
>  Issue Type: Improvement
>  Components: dsl-sql, io-java-parquet
>Reporter: Vishwas
>Assignee: Vishwas
>Priority: Major
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Add support for inferring Beam Schema in ParquetIO.
> Refactor ParquetTable code to use Convert.rows().
> Remove unnecessary java class GenericRecordReadConverter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-7746) Add type hints to python code

2019-10-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7746?focusedWorklogId=333061=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-333061
 ]

ASF GitHub Bot logged work on BEAM-7746:


Author: ASF GitHub Bot
Created on: 24/Oct/19 04:14
Start Date: 24/Oct/19 04:14
Worklog Time Spent: 10m 
  Work Description: chadrik commented on issue #9056: [BEAM-7746] Add 
python type hints
URL: https://github.com/apache/beam/pull/9056#issuecomment-545735730
 
 
   Ok, I have what I think are satisfactory answers to all of the problems that 
I encountered except for the 4 issues below that I need input on. 
   
   @robertwb can you help me find the answers to these, please!
   
   ---
   
   ```
   apache_beam/io/iobase.py:924: error: "SourceBase" has no attribute "coder"  
[attr-defined]
   ```
   
   I can't find any sub-classes of `SourceBase` that have a coder attribute.  
Is this safe to remove or is this a dataflow thing?
   
   ---
   
   ```
   apache_beam/runners/worker/statesampler_slow.py:77: error: "StateSampler" 
has no attribute "_states_by_name"  [attr-defined]
   ```
   
   `statesampler_slow.StateSampler` does not have `_states_by_name` attribute, 
but `statesampler.StateSampler` does.  I could add this attribute to 
`statesampler_slow.StateSampler`, but I don't think it would be used. The more 
straightforward solution may be to edit 
`statesampler_slow.StateSampler.reset()` to do nothing.  Right now I think it 
would error if it ran.
   
   ---
   ```
   apache_beam/runners/portability/fn_api_runner_transforms.py:280: error: 
Invalid index type "Optional[str]" for "MutableMapping[str, Environment]"; 
expected type "str"  [index]
   ```
   
   It's unclear to me whether `Stage.environment` is meant to be `str` or 
`Optional[str]`.  The way that it's initialized it _could_ be `Optional[str]`:
   
   ```python
   class Stage(object):
 """A set of Transforms that can be sent to the worker for processing."""
 def __init__(self,
  name,  # type: str
  transforms,  # type: List[beam_runner_api_pb2.PTransform]
  downstream_side_inputs=None,  # type: Optional[FrozenSet[str]]
  must_follow=frozenset(),  # type: FrozenSet[Stage]
  parent=None,  # type: Optional[Stage]
  environment=None,  # type: Optional[str]
  forced_root=False
 ):
   ...
   
   if environment is None:
 environment = functools.reduce(
 self._merge_environments,
 (self._extract_environment(t) for t in transforms))
   self.environment = environment
   
 ...
   
 @staticmethod
 def _extract_environment(transform):
   # type: (beam_runner_api_pb2.PTransform) -> Optional[str]
   if transform.spec.urn in PAR_DO_URNS:
 pardo_payload = proto_utils.parse_Bytes(
 transform.spec.payload, beam_runner_api_pb2.ParDoPayload)
 return pardo_payload.do_fn.environment_id
   elif transform.spec.urn in COMBINE_URNS:
 combine_payload = proto_utils.parse_Bytes(
 transform.spec.payload, beam_runner_api_pb2.CombinePayload)
 return combine_payload.combine_fn.environment_id
   else:
 return None
   ```
   
   In practice will there will always be at least one ParDo or Combine per 
stage?  If so we should be asserting that `self.environment is not None` in 
`Stage.__init__`.  
   
   Alternately, we could assert that `self.environment is not None` just before 
this call in `executable_stage_transform`.  
   
   The bottom line is that there are currently no guarantees in the code that 
`self.environment` is not None at this point, and if it is, it will be an error.
   
   ---
   
   ```
   apache_beam/runners/portability/fn_api_runner.py:933: error: 
"ParallelBundleManager" has no attribute "_skip_registration"  [attr-defined]
   ```
   
   I can't find anywhere in the code that refers to `_skip_registration`.  Is 
this safe to remove?
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 333061)
Time Spent: 9.5h  (was: 9h 20m)

> Add type hints to python code
> -
>
> Key: BEAM-7746
> URL: https://issues.apache.org/jira/browse/BEAM-7746
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chad Dombrova
>Assignee: Chad Dombrova
>Priority: Major
>  Time Spent: 9.5h
>  Remaining Estimate: 0h
>
> As a 

[jira] [Work logged] (BEAM-7926) Visualize PCollection with Interactive Beam

2019-10-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7926?focusedWorklogId=333026=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-333026
 ]

ASF GitHub Bot logged work on BEAM-7926:


Author: ASF GitHub Bot
Created on: 24/Oct/19 03:24
Start Date: 24/Oct/19 03:24
Worklog Time Spent: 10m 
  Work Description: pabloem commented on pull request #9771: [BEAM-7926] 
Update dependencies in Java Katas
URL: https://github.com/apache/beam/pull/9771
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 333026)
Time Spent: 14h 10m  (was: 14h)

> Visualize PCollection with Interactive Beam
> ---
>
> Key: BEAM-7926
> URL: https://issues.apache.org/jira/browse/BEAM-7926
> Project: Beam
>  Issue Type: New Feature
>  Components: runner-py-interactive
>Reporter: Ning Kang
>Assignee: Ning Kang
>Priority: Major
>  Time Spent: 14h 10m
>  Remaining Estimate: 0h
>
> Support auto plotting / charting of materialized data of a given PCollection 
> with Interactive Beam.
> Say an Interactive Beam pipeline defined as
> p = create_pipeline()
> pcoll = p | 'Transform' >> transform()
> The use can call a single function and get auto-magical charting of the data 
> as materialized pcoll.
> e.g., visualize(pcoll)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-7926) Visualize PCollection with Interactive Beam

2019-10-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7926?focusedWorklogId=333027=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-333027
 ]

ASF GitHub Bot logged work on BEAM-7926:


Author: ASF GitHub Bot
Created on: 24/Oct/19 03:24
Start Date: 24/Oct/19 03:24
Worklog Time Spent: 10m 
  Work Description: pabloem commented on issue #9771: [BEAM-7926] Update 
dependencies in Java Katas
URL: https://github.com/apache/beam/pull/9771#issuecomment-545725673
 
 
   Thanks for the review @henryken 
   and thanks a lot for the PR @leonardoam !
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 333027)
Time Spent: 14h 20m  (was: 14h 10m)

> Visualize PCollection with Interactive Beam
> ---
>
> Key: BEAM-7926
> URL: https://issues.apache.org/jira/browse/BEAM-7926
> Project: Beam
>  Issue Type: New Feature
>  Components: runner-py-interactive
>Reporter: Ning Kang
>Assignee: Ning Kang
>Priority: Major
>  Time Spent: 14h 20m
>  Remaining Estimate: 0h
>
> Support auto plotting / charting of materialized data of a given PCollection 
> with Interactive Beam.
> Say an Interactive Beam pipeline defined as
> p = create_pipeline()
> pcoll = p | 'Transform' >> transform()
> The use can call a single function and get auto-magical charting of the data 
> as materialized pcoll.
> e.g., visualize(pcoll)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-7926) Visualize PCollection with Interactive Beam

2019-10-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7926?focusedWorklogId=333023=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-333023
 ]

ASF GitHub Bot logged work on BEAM-7926:


Author: ASF GitHub Bot
Created on: 24/Oct/19 03:19
Start Date: 24/Oct/19 03:19
Worklog Time Spent: 10m 
  Work Description: henryken commented on issue #9771: [BEAM-7926] Update 
dependencies in Java Katas
URL: https://github.com/apache/beam/pull/9771#issuecomment-545724583
 
 
   LGTM. I had a quick test and nothing breaks.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 333023)
Time Spent: 14h  (was: 13h 50m)

> Visualize PCollection with Interactive Beam
> ---
>
> Key: BEAM-7926
> URL: https://issues.apache.org/jira/browse/BEAM-7926
> Project: Beam
>  Issue Type: New Feature
>  Components: runner-py-interactive
>Reporter: Ning Kang
>Assignee: Ning Kang
>Priority: Major
>  Time Spent: 14h
>  Remaining Estimate: 0h
>
> Support auto plotting / charting of materialized data of a given PCollection 
> with Interactive Beam.
> Say an Interactive Beam pipeline defined as
> p = create_pipeline()
> pcoll = p | 'Transform' >> transform()
> The use can call a single function and get auto-magical charting of the data 
> as materialized pcoll.
> e.g., visualize(pcoll)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-8446) apache_beam.io.gcp.bigquery_write_it_test.BigQueryWriteIntegrationTests.test_big_query_write_new_types is flaky

2019-10-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8446?focusedWorklogId=333022=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-333022
 ]

ASF GitHub Bot logged work on BEAM-8446:


Author: ASF GitHub Bot
Created on: 24/Oct/19 03:16
Start Date: 24/Oct/19 03:16
Worklog Time Spent: 10m 
  Work Description: pabloem commented on issue #9855: [BEAM-8446] Retrying 
BQ query on timeouts
URL: https://github.com/apache/beam/pull/9855#issuecomment-545724097
 
 
   Hm I added it because the Pydoc specifies it is a possibility. I did not 
reproduce it because the job is flaky...
   
https://googleapis.dev/python/bigquery/latest/generated/google.cloud.bigquery.job.QueryJob.html#google.cloud.bigquery.job.QueryJob.result
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 333022)
Time Spent: 3h 40m  (was: 3.5h)

> apache_beam.io.gcp.bigquery_write_it_test.BigQueryWriteIntegrationTests.test_big_query_write_new_types
>  is flaky
> ---
>
> Key: BEAM-8446
> URL: https://issues.apache.org/jira/browse/BEAM-8446
> Project: Beam
>  Issue Type: New Feature
>  Components: test-failures
>Reporter: Boyuan Zhang
>Assignee: Pablo Estrada
>Priority: Major
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> test_big_query_write_new_types appears to be flaky in 
> beam_PostCommit_Python37 test suite.
> https://builds.apache.org/job/beam_PostCommit_Python37/733/
> https://builds.apache.org/job/beam_PostCommit_Python37/739/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-8388) Update Avro to 1.9.1 from 1.8.2

2019-10-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8388?focusedWorklogId=333007=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-333007
 ]

ASF GitHub Bot logged work on BEAM-8388:


Author: ASF GitHub Bot
Created on: 24/Oct/19 02:33
Start Date: 24/Oct/19 02:33
Worklog Time Spent: 10m 
  Work Description: kennknowles commented on issue #9779: [BEAM-8388] 
Updating avro dependency from 1.8.2 to 1.9.1
URL: https://github.com/apache/beam/pull/9779#issuecomment-545715194
 
 
   Seems like including Avro in the core SDK might be a mistake / legacy 
problem. We could potentially create new artifacts for particular versions so 
users can choose compatible pieces of Beam.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 333007)
Remaining Estimate: 21h 20m  (was: 21.5h)
Time Spent: 2h 40m  (was: 2.5h)

> Update Avro to 1.9.1 from 1.8.2
> ---
>
> Key: BEAM-8388
> URL: https://issues.apache.org/jira/browse/BEAM-8388
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-avro
>Reporter: Jordanna Chord
>Assignee: Jordanna Chord
>Priority: Major
>   Original Estimate: 24h
>  Time Spent: 2h 40m
>  Remaining Estimate: 21h 20m
>
> Update build dependency to 1.9.1



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-8446) apache_beam.io.gcp.bigquery_write_it_test.BigQueryWriteIntegrationTests.test_big_query_write_new_types is flaky

2019-10-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8446?focusedWorklogId=332997=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332997
 ]

ASF GitHub Bot logged work on BEAM-8446:


Author: ASF GitHub Bot
Created on: 24/Oct/19 01:26
Start Date: 24/Oct/19 01:26
Worklog Time Spent: 10m 
  Work Description: udim commented on issue #9855: [BEAM-8446] Retrying BQ 
query on timeouts
URL: https://github.com/apache/beam/pull/9855#issuecomment-545701377
 
 
   > Again timeout without failures: 
https://builds.apache.org/job/beam_PostCommit_Python37_PR/46/
   
   I've tried looking at these with no luck. :(
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 332997)
Time Spent: 3.5h  (was: 3h 20m)

> apache_beam.io.gcp.bigquery_write_it_test.BigQueryWriteIntegrationTests.test_big_query_write_new_types
>  is flaky
> ---
>
> Key: BEAM-8446
> URL: https://issues.apache.org/jira/browse/BEAM-8446
> Project: Beam
>  Issue Type: New Feature
>  Components: test-failures
>Reporter: Boyuan Zhang
>Assignee: Pablo Estrada
>Priority: Major
>  Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> test_big_query_write_new_types appears to be flaky in 
> beam_PostCommit_Python37 test suite.
> https://builds.apache.org/job/beam_PostCommit_Python37/733/
> https://builds.apache.org/job/beam_PostCommit_Python37/739/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-8446) apache_beam.io.gcp.bigquery_write_it_test.BigQueryWriteIntegrationTests.test_big_query_write_new_types is flaky

2019-10-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8446?focusedWorklogId=332996=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332996
 ]

ASF GitHub Bot logged work on BEAM-8446:


Author: ASF GitHub Bot
Created on: 24/Oct/19 01:23
Start Date: 24/Oct/19 01:23
Worklog Time Spent: 10m 
  Work Description: udim commented on pull request #9855: [BEAM-8446] 
Retrying BQ query on timeouts
URL: https://github.com/apache/beam/pull/9855#discussion_r338345852
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/tests/bigquery_matcher.py
 ##
 @@ -46,9 +47,11 @@
 MAX_RETRIES = 5
 
 
-def retry_on_http_and_value_error(exception):
+def retry_on_http_timeout_and_value_error(exception):
   """Filter allowing retries on Bigquery errors and value error."""
-  return isinstance(exception, (GoogleCloudError, ValueError))
+  return isinstance(exception, (GoogleCloudError,
+ValueError,
+concurrent.futures.TimeoutError))
 
 Review comment:
   Where is this `TimeoutError` being raised? I didn't see it in the failed 
tests' logs .
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 332996)
Time Spent: 3h 20m  (was: 3h 10m)

> apache_beam.io.gcp.bigquery_write_it_test.BigQueryWriteIntegrationTests.test_big_query_write_new_types
>  is flaky
> ---
>
> Key: BEAM-8446
> URL: https://issues.apache.org/jira/browse/BEAM-8446
> Project: Beam
>  Issue Type: New Feature
>  Components: test-failures
>Reporter: Boyuan Zhang
>Assignee: Pablo Estrada
>Priority: Major
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> test_big_query_write_new_types appears to be flaky in 
> beam_PostCommit_Python37 test suite.
> https://builds.apache.org/job/beam_PostCommit_Python37/733/
> https://builds.apache.org/job/beam_PostCommit_Python37/739/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-8446) apache_beam.io.gcp.bigquery_write_it_test.BigQueryWriteIntegrationTests.test_big_query_write_new_types is flaky

2019-10-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8446?focusedWorklogId=332994=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332994
 ]

ASF GitHub Bot logged work on BEAM-8446:


Author: ASF GitHub Bot
Created on: 24/Oct/19 01:21
Start Date: 24/Oct/19 01:21
Worklog Time Spent: 10m 
  Work Description: pabloem commented on issue #9855: [BEAM-8446] Retrying 
BQ query on timeouts
URL: https://github.com/apache/beam/pull/9855#issuecomment-545700463
 
 
   Again timeout without failures: 
https://builds.apache.org/job/beam_PostCommit_Python37_PR/46/
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 332994)
Time Spent: 3h 10m  (was: 3h)

> apache_beam.io.gcp.bigquery_write_it_test.BigQueryWriteIntegrationTests.test_big_query_write_new_types
>  is flaky
> ---
>
> Key: BEAM-8446
> URL: https://issues.apache.org/jira/browse/BEAM-8446
> Project: Beam
>  Issue Type: New Feature
>  Components: test-failures
>Reporter: Boyuan Zhang
>Assignee: Pablo Estrada
>Priority: Major
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> test_big_query_write_new_types appears to be flaky in 
> beam_PostCommit_Python37 test suite.
> https://builds.apache.org/job/beam_PostCommit_Python37/733/
> https://builds.apache.org/job/beam_PostCommit_Python37/739/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-8446) apache_beam.io.gcp.bigquery_write_it_test.BigQueryWriteIntegrationTests.test_big_query_write_new_types is flaky

2019-10-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8446?focusedWorklogId=332993=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332993
 ]

ASF GitHub Bot logged work on BEAM-8446:


Author: ASF GitHub Bot
Created on: 24/Oct/19 01:20
Start Date: 24/Oct/19 01:20
Worklog Time Spent: 10m 
  Work Description: pabloem commented on issue #9855: [BEAM-8446] Retrying 
BQ query on timeouts
URL: https://github.com/apache/beam/pull/9855#issuecomment-545700383
 
 
   Run Python 3.7 PostCommit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 332993)
Time Spent: 3h  (was: 2h 50m)

> apache_beam.io.gcp.bigquery_write_it_test.BigQueryWriteIntegrationTests.test_big_query_write_new_types
>  is flaky
> ---
>
> Key: BEAM-8446
> URL: https://issues.apache.org/jira/browse/BEAM-8446
> Project: Beam
>  Issue Type: New Feature
>  Components: test-failures
>Reporter: Boyuan Zhang
>Assignee: Pablo Estrada
>Priority: Major
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> test_big_query_write_new_types appears to be flaky in 
> beam_PostCommit_Python37 test suite.
> https://builds.apache.org/job/beam_PostCommit_Python37/733/
> https://builds.apache.org/job/beam_PostCommit_Python37/739/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-8146) SchemaCoder/RowCoder have no equals() function

2019-10-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8146?focusedWorklogId=332987=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332987
 ]

ASF GitHub Bot logged work on BEAM-8146:


Author: ASF GitHub Bot
Created on: 24/Oct/19 01:09
Start Date: 24/Oct/19 01:09
Worklog Time Spent: 10m 
  Work Description: kennknowles commented on issue #9493: 
[BEAM-8146,BEAM-8204,BEAM-8205] Add equals and hashCode to SchemaCoder and 
RowCoder
URL: https://github.com/apache/beam/pull/9493#issuecomment-545698025
 
 
   I would say keep as-is. Being able to implement `getEncodedTypeDescriptor` 
adds value on its own.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 332987)
Time Spent: 3h  (was: 2h 50m)

> SchemaCoder/RowCoder have no equals() function
> --
>
> Key: BEAM-8146
> URL: https://issues.apache.org/jira/browse/BEAM-8146
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core
>Affects Versions: 2.15.0
>Reporter: Brian Hulette
>Assignee: Brian Hulette
>Priority: Major
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> SchemaCoder has no equals function, so it can't be compared in tests, like 
> CloudComponentsTests$DefaultCoders, which is being re-enabled in 
> https://github.com/apache/beam/pull/9446



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-8146) SchemaCoder/RowCoder have no equals() function

2019-10-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8146?focusedWorklogId=332989=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332989
 ]

ASF GitHub Bot logged work on BEAM-8146:


Author: ASF GitHub Bot
Created on: 24/Oct/19 01:09
Start Date: 24/Oct/19 01:09
Worklog Time Spent: 10m 
  Work Description: kennknowles commented on issue #9493: 
[BEAM-8146,BEAM-8204,BEAM-8205] Add equals and hashCode to SchemaCoder and 
RowCoder
URL: https://github.com/apache/beam/pull/9493#issuecomment-545698141
 
 
   Given how long you've been waiting on this, I'm inclined to merge. Last 
chance to comment @reuvenlax ?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 332989)
Time Spent: 3h 10m  (was: 3h)

> SchemaCoder/RowCoder have no equals() function
> --
>
> Key: BEAM-8146
> URL: https://issues.apache.org/jira/browse/BEAM-8146
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core
>Affects Versions: 2.15.0
>Reporter: Brian Hulette
>Assignee: Brian Hulette
>Priority: Major
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> SchemaCoder has no equals function, so it can't be compared in tests, like 
> CloudComponentsTests$DefaultCoders, which is being re-enabled in 
> https://github.com/apache/beam/pull/9446



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-7981) ParDo function wrapper doesn't support Iterable output types

2019-10-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7981?focusedWorklogId=332986=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332986
 ]

ASF GitHub Bot logged work on BEAM-7981:


Author: ASF GitHub Bot
Created on: 24/Oct/19 00:49
Start Date: 24/Oct/19 00:49
Worklog Time Spent: 10m 
  Work Description: udim commented on pull request #9708: [BEAM-7981] Fix 
double iterable stripping
URL: https://github.com/apache/beam/pull/9708
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 332986)
Time Spent: 2h 10m  (was: 2h)

> ParDo function wrapper doesn't support Iterable output types
> 
>
> Key: BEAM-7981
> URL: https://issues.apache.org/jira/browse/BEAM-7981
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Udi Meiri
>Assignee: Udi Meiri
>Priority: Major
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> I believe the bug is in CallableWrapperDoFn.default_type_hints, which 
> converts Iterable[str] to str.
> This test will be included (commented out) in 
> https://github.com/apache/beam/pull/9283
> {code}
>   def test_typed_callable_iterable_output(self):
> @typehints.with_input_types(int)
> @typehints.with_output_types(typehints.Iterable[str])
> def do_fn(element):
>   return [[str(element)] * 2]
> result = [1, 2] | beam.ParDo(do_fn)
> self.assertEqual([['1', '1'], ['2', '2']], sorted(result))
> {code}
> Result:
> {code}
> ==
> ERROR: test_typed_callable_iterable_output 
> (apache_beam.typehints.typed_pipeline_test.MainInputTest)
> --
> Traceback (most recent call last):
>   File 
> "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/typehints/typed_pipeline_test.py",
>  line 104, in test_typed_callable_iterable_output
> result = [1, 2] | beam.ParDo(do_fn)
>   File 
> "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/transforms/ptransform.py",
>  line 519, in __ror__
> p.run().wait_until_finish()
>   File 
> "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/pipeline.py", 
> line 406, in run
> self._options).run(False)
>   File 
> "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/pipeline.py", 
> line 419, in run
> return self.runner.run_pipeline(self, self._options)
>   File 
> "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/runners/direct/direct_runner.py",
>  line 129, in run_pipeline
> return runner.run_pipeline(pipeline, options)
>   File 
> "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/runners/portability/fn_api_runner.py",
>  line 366, in run_pipeline
> default_environment=self._default_environment))
>   File 
> "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/runners/portability/fn_api_runner.py",
>  line 373, in run_via_runner_api
> return self.run_stages(stage_context, stages)
>   File 
> "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/runners/portability/fn_api_runner.py",
>  line 455, in run_stages
> stage_context.safe_coders)
>   File 
> "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/runners/portability/fn_api_runner.py",
>  line 733, in _run_stage
> result, splits = bundle_manager.process_bundle(data_input, data_output)
>   File 
> "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/runners/portability/fn_api_runner.py",
>  line 1663, in process_bundle
> part, expected_outputs), part_inputs):
>   File "/usr/lib/python3.7/concurrent/futures/_base.py", line 586, in 
> result_iterator
> yield fs.pop().result()
>   File "/usr/lib/python3.7/concurrent/futures/_base.py", line 432, in result
> return self.__get_result()
>   File "/usr/lib/python3.7/concurrent/futures/_base.py", line 384, in 
> __get_result
> raise self._exception
>   File "/usr/lib/python3.7/concurrent/futures/thread.py", line 57, in run
> result = self.fn(*self.args, **self.kwargs)
>   File 
> "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/runners/portability/fn_api_runner.py",
>  line 1663, in 
> part, expected_outputs), part_inputs):
>   File 
> "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/runners/portability/fn_api_runner.py",
>  line 1601, in process_bundle
> result_future = 

[jira] [Work logged] (BEAM-7886) Make row coder a standard coder and implement in python

2019-10-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7886?focusedWorklogId=332985=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332985
 ]

ASF GitHub Bot logged work on BEAM-7886:


Author: ASF GitHub Bot
Created on: 24/Oct/19 00:46
Start Date: 24/Oct/19 00:46
Worklog Time Spent: 10m 
  Work Description: TheNeuralBit commented on issue #9188: [BEAM-7886] Make 
row coder a standard coder and implement in Python
URL: https://github.com/apache/beam/pull/9188#issuecomment-545693677
 
 
   Run Python PreCommit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 332985)
Time Spent: 14h 10m  (was: 14h)

> Make row coder a standard coder and implement in python
> ---
>
> Key: BEAM-7886
> URL: https://issues.apache.org/jira/browse/BEAM-7886
> Project: Beam
>  Issue Type: Improvement
>  Components: beam-model, sdk-java-core, sdk-py-core
>Reporter: Brian Hulette
>Assignee: Brian Hulette
>Priority: Major
>  Time Spent: 14h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-8451) Interactive Beam example failing from stack overflow

2019-10-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8451?focusedWorklogId=332983=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332983
 ]

ASF GitHub Bot logged work on BEAM-8451:


Author: ASF GitHub Bot
Created on: 24/Oct/19 00:35
Start Date: 24/Oct/19 00:35
Worklog Time Spent: 10m 
  Work Description: aaltay commented on issue #9865: [BEAM-8451] Fix 
interactive beam max recursion err
URL: https://github.com/apache/beam/pull/9865#issuecomment-545691833
 
 
   R: @KevinGG 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 332983)
Time Spent: 0.5h  (was: 20m)

> Interactive Beam example failing from stack overflow
> 
>
> Key: BEAM-8451
> URL: https://issues.apache.org/jira/browse/BEAM-8451
> Project: Beam
>  Issue Type: Bug
>  Components: examples-python, runner-py-interactive
>Reporter: Igor Durovic
>Assignee: Igor Durovic
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
>  
> RecursionError: maximum recursion depth exceeded in __instancecheck__
> at 
> [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/interactive/pipeline_analyzer.py#L405]
>  
> This occurred after the execution of the last cell in 
> [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/interactive/examples/Interactive%20Beam%20Example.ipynb]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-8451) Interactive Beam example failing from stack overflow

2019-10-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8451?focusedWorklogId=332980=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332980
 ]

ASF GitHub Bot logged work on BEAM-8451:


Author: ASF GitHub Bot
Created on: 24/Oct/19 00:16
Start Date: 24/Oct/19 00:16
Worklog Time Spent: 10m 
  Work Description: chunyang commented on issue #9865: [BEAM-8451] Fix 
interactive beam max recursion err
URL: https://github.com/apache/beam/pull/9865#issuecomment-545688129
 
 
   Run PythonLint PreCommit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 332980)
Time Spent: 20m  (was: 10m)

> Interactive Beam example failing from stack overflow
> 
>
> Key: BEAM-8451
> URL: https://issues.apache.org/jira/browse/BEAM-8451
> Project: Beam
>  Issue Type: Bug
>  Components: examples-python, runner-py-interactive
>Reporter: Igor Durovic
>Assignee: Igor Durovic
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
>  
> RecursionError: maximum recursion depth exceeded in __instancecheck__
> at 
> [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/interactive/pipeline_analyzer.py#L405]
>  
> This occurred after the execution of the last cell in 
> [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/interactive/examples/Interactive%20Beam%20Example.ipynb]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-8382) Add polling interval to KinesisIO.Read

2019-10-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8382?focusedWorklogId=332974=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332974
 ]

ASF GitHub Bot logged work on BEAM-8382:


Author: ASF GitHub Bot
Created on: 24/Oct/19 00:07
Start Date: 24/Oct/19 00:07
Worklog Time Spent: 10m 
  Work Description: jfarr commented on issue #9765: [BEAM-8382] Add polling 
interval to KinesisIO.Read
URL: https://github.com/apache/beam/pull/9765#issuecomment-545686422
 
 
   @aromanenko-dev I'm still testing and gathering data but one of my 
colleagues had an idea I wanted to run by you. What if we add a polling rate 
policy object and have fixed and fluent versions with fluent as the default?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 332974)
Time Spent: 5h  (was: 4h 50m)

> Add polling interval to KinesisIO.Read
> --
>
> Key: BEAM-8382
> URL: https://issues.apache.org/jira/browse/BEAM-8382
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-kinesis
>Affects Versions: 2.13.0, 2.14.0, 2.15.0
>Reporter: Jonothan Farr
>Assignee: Jonothan Farr
>Priority: Major
>  Time Spent: 5h
>  Remaining Estimate: 0h
>
> With the current implementation we are observing Kinesis throttling due to 
> ReadProvisionedThroughputExceeded on the order of hundreds of times per 
> second, regardless of the actual Kinesis throughput. This is because the 
> ShardReadersPool readLoop() method is polling getRecords() as fast as 
> possible.
> From the KDS documentation:
> {quote}Each shard can support up to five read transactions per second.
> {quote}
> and
> {quote}For best results, sleep for at least 1 second (1,000 milliseconds) 
> between calls to getRecords to avoid exceeding the limit on getRecords 
> frequency.
> {quote}
> [https://docs.aws.amazon.com/streams/latest/dev/service-sizes-and-limits.html]
> [https://docs.aws.amazon.com/streams/latest/dev/developing-consumers-with-sdk.html]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-8397) DataflowRunnerTest.test_remote_runner_display_data fails due to infinite recursion during pickling.

2019-10-23 Thread Valentyn Tymofieiev (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-8397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958419#comment-16958419
 ] 

Valentyn Tymofieiev commented on BEAM-8397:
---

(Did not see an update from previous comment)

> DataflowRunnerTest.test_remote_runner_display_data fails due to infinite 
> recursion during pickling.
> ---
>
> Key: BEAM-8397
> URL: https://issues.apache.org/jira/browse/BEAM-8397
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Valentyn Tymofieiev
>Priority: Major
>
> `python ./setup.py test -s 
> apache_beam.runners.dataflow.dataflow_runner_test.DataflowRunnerTest.test_remote_runner_display_data`
>  passes.
> `tox -e py37-gcp` passes if Beam depends on dill==0.3.0, but fails if Beam 
> depends on dill==0.3.1.1.`python ./setup.py nosetests --tests 
> 'apache_beam/runners/dataflow/dataflow_runner_test.py:DataflowRunnerTest.test_remote_runner_display_data`
>  fails currently if run on master.
> The failure indicates infinite recursion during pickling:
> {noformat}
> test_remote_runner_display_data 
> (apache_beam.runners.dataflow.dataflow_runner_test.DataflowRunnerTest) ... 
> Fatal Python error: Cannot recover from stack overflow.
> Current thread 0x7f9d700ed740 (most recent call first):
>   File "/usr/lib/python3.7/pickle.py", line 479 in get
>   File "/usr/lib/python3.7/pickle.py", line 497 in save
>   File "/usr/lib/python3.7/pickle.py", line 786 in save_tuple
>   File "/usr/lib/python3.7/pickle.py", line 504 in save
>   File "/usr/lib/python3.7/pickle.py", line 638 in save_reduce
>   File 
> "/usr/local/google/home/valentyn/tmp/py37env/lib/python3.7/site-packages/dill/_dill.py",
>  line 1394 in save_function
>   File "/usr/lib/python3.7/pickle.py", line 504 in save
>   File "/usr/lib/python3.7/pickle.py", line 882 in _batch_setitems
>   File "/usr/lib/python3.7/pickle.py", line 856 in save_dict
>   File 
> "/usr/local/google/home/valentyn/tmp/py37env/lib/python3.7/site-packages/dill/_dill.py",
>  line 910 in save_module_dict
>   File 
> "/usr/local/google/home/valentyn/projects/beam/clean/beam/sdks/python/apache_beam/internal/pickler.py",
>  line 198 in new_save_module_dict
>   File "/usr/lib/python3.7/pickle.py", line 504 in save
>   File "/usr/lib/python3.7/pickle.py", line 786 in save_tuple
>   File "/usr/lib/python3.7/pickle.py", line 504 in save
>   File "/usr/lib/python3.7/pickle.py", line 638 in save_reduce
>   File 
> "/usr/local/google/home/valentyn/projects/beam/clean/beam/sdks/python/apache_beam/internal/pickler.py",
>  line 114 in wrapper
>   File "/usr/lib/python3.7/pickle.py", line 504 in save
>   File "/usr/lib/python3.7/pickle.py", line 771 in save_tuple
>   File "/usr/lib/python3.7/pickle.py", line 504 in save
>   File "/usr/lib/python3.7/pickle.py", line 638 in save_reduce
>   File 
> "/usr/local/google/home/valentyn/tmp/py37env/lib/python3.7/site-packages/dill/_dill.py",
>  line 1137 in save_cell
>   File "/usr/lib/python3.7/pickle.py", line 504 in save
>   File "/usr/lib/python3.7/pickle.py", line 771 in save_tuple
>   File "/usr/lib/python3.7/pickle.py", line 504 in save
>   File "/usr/lib/python3.7/pickle.py", line 786 in save_tuple
>   File "/usr/lib/python3.7/pickle.py", line 504 in save
>   File "/usr/lib/python3.7/pickle.py", line 638 in save_reduce
>   File 
> "/usr/local/google/home/valentyn/tmp/py37env/lib/python3.7/site-packages/dill/_dill.py",
>  line 1394 in save_function
>   File "/usr/lib/python3.7/pickle.py", line 504 in save
>   File "/usr/lib/python3.7/pickle.py", line 882 in _batch_setitems
>   File "/usr/lib/python3.7/pickle.py", line 856 in save_dict
>   File 
> "/usr/local/google/home/valentyn/tmp/py37env/lib/python3.7/site-packages/dill/_dill.py",
>  line 910 in save_module_dict
>   File 
> "/usr/local/google/home/valentyn/projects/beam/clean/beam/sdks/python/apache_beam/internal/pickler.py",
>  line 198 in new_save_module_dict
> ...
> {noformat}
> cc: [~yoshiki.obata]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-8397) DataflowRunnerTest.test_remote_runner_display_data fails due to infinite recursion during pickling.

2019-10-23 Thread Valentyn Tymofieiev (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-8397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958417#comment-16958417
 ] 

Valentyn Tymofieiev commented on BEAM-8397:
---

[~udim] pointed out "asubcomponent" entry is associated with an instance of 
HasDisplayData, so instead of adding "asubcomponent" entry, we accumulate 
associated display_data values as per [1], and thus populate the dofn_value 
key. so the scenario of the existing test makes sense.

However, when we move SpecialDoFn, SpecialParDo out of 
test_remote_runner_display_data method, the display_data suddenly becomes:
{noformat}
[{'key': 'dofn_value', 'namespace': 
'apache_beam.runners.dataflow.dataflow_runner_test.SpecialDoFn', 'type': 
'INTEGER', 'value': 42}, 
{'key': 'fn', 'namespace': 'apache_beam.transforms.core.ParDo', 'shortValue': 
'SpecialDoFn', 'label': 'Transform Function', 'type': 'STRING', 'value': 
'apache_beam.runners.dataflow.dataflow_runner_test.SpecialDoFn'}]   

{noformat}
While the expected display_data is
{noformat}
[{'key': 'a_time, 'namespace': 
'apache_beam.runners.dataflow.dataflow_runner_test.SpecialParDo', 'type': 
'TIMESTAMP', 'value': 1571829781919, '}, 
{'key': 'a_class', 'shortValue': 'SpecialParDo', 'namespace': 
'apache_beam.runners.dataflow.dataflow_runner_test.SpecialParDo', 'type': 
'STRING', 'value': 
'apache_beam.runners.dataflow.dataflow_runner_test.SpecialParDo'}, 
{'key': 'dofn_value', 'namespace': 
'apache_beam.runners.dataflow.dataflow_runner_test.SpecialDoFn', 'type': 
'INTEGER', 'value': 42,}]
{noformat}
It appears that display_data method of SpecialParDo is no longer called, and 
instead display_data is populated from display_data of ParDo class, note the 
namespace is transforms.core.ParDo. This rootcause of this behavior is unclear 
so far.

[1] 
[https://github.com/apache/beam/blob/8e574a177794aa04345cf8cd0c2ce1717ee40ec6/sdks/python/apache_beam/transforms/display.py#L100]

> DataflowRunnerTest.test_remote_runner_display_data fails due to infinite 
> recursion during pickling.
> ---
>
> Key: BEAM-8397
> URL: https://issues.apache.org/jira/browse/BEAM-8397
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Valentyn Tymofieiev
>Priority: Major
>
> `python ./setup.py test -s 
> apache_beam.runners.dataflow.dataflow_runner_test.DataflowRunnerTest.test_remote_runner_display_data`
>  passes.
> `tox -e py37-gcp` passes if Beam depends on dill==0.3.0, but fails if Beam 
> depends on dill==0.3.1.1.`python ./setup.py nosetests --tests 
> 'apache_beam/runners/dataflow/dataflow_runner_test.py:DataflowRunnerTest.test_remote_runner_display_data`
>  fails currently if run on master.
> The failure indicates infinite recursion during pickling:
> {noformat}
> test_remote_runner_display_data 
> (apache_beam.runners.dataflow.dataflow_runner_test.DataflowRunnerTest) ... 
> Fatal Python error: Cannot recover from stack overflow.
> Current thread 0x7f9d700ed740 (most recent call first):
>   File "/usr/lib/python3.7/pickle.py", line 479 in get
>   File "/usr/lib/python3.7/pickle.py", line 497 in save
>   File "/usr/lib/python3.7/pickle.py", line 786 in save_tuple
>   File "/usr/lib/python3.7/pickle.py", line 504 in save
>   File "/usr/lib/python3.7/pickle.py", line 638 in save_reduce
>   File 
> "/usr/local/google/home/valentyn/tmp/py37env/lib/python3.7/site-packages/dill/_dill.py",
>  line 1394 in save_function
>   File "/usr/lib/python3.7/pickle.py", line 504 in save
>   File "/usr/lib/python3.7/pickle.py", line 882 in _batch_setitems
>   File "/usr/lib/python3.7/pickle.py", line 856 in save_dict
>   File 
> "/usr/local/google/home/valentyn/tmp/py37env/lib/python3.7/site-packages/dill/_dill.py",
>  line 910 in save_module_dict
>   File 
> "/usr/local/google/home/valentyn/projects/beam/clean/beam/sdks/python/apache_beam/internal/pickler.py",
>  line 198 in new_save_module_dict
>   File "/usr/lib/python3.7/pickle.py", line 504 in save
>   File "/usr/lib/python3.7/pickle.py", line 786 in save_tuple
>   File "/usr/lib/python3.7/pickle.py", line 504 in save
>   File "/usr/lib/python3.7/pickle.py", line 638 in save_reduce
>   File 
> "/usr/local/google/home/valentyn/projects/beam/clean/beam/sdks/python/apache_beam/internal/pickler.py",
>  line 114 in wrapper
>   File "/usr/lib/python3.7/pickle.py", line 504 in save
>   File "/usr/lib/python3.7/pickle.py", line 771 in save_tuple
>   File "/usr/lib/python3.7/pickle.py", line 504 in save
>   File "/usr/lib/python3.7/pickle.py", line 638 in save_reduce
>   File 
> "/usr/local/google/home/valentyn/tmp/py37env/lib/python3.7/site-packages/dill/_dill.py",
>  line 1137 in save_cell

[jira] [Updated] (BEAM-8466) Python typehints: pep 484 warn and strict modes

2019-10-23 Thread Udi Meiri (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udi Meiri updated BEAM-8466:

Status: Open  (was: Triage Needed)

> Python typehints: pep 484 warn and strict modes
> ---
>
> Key: BEAM-8466
> URL: https://issues.apache.org/jira/browse/BEAM-8466
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Udi Meiri
>Assignee: Udi Meiri
>Priority: Major
>
> Allow type checking to use PEP 484 type hints, but only warn if there are 
> errors, and in another mode to raise exceptions on errors.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-8397) DataflowRunnerTest.test_remote_runner_display_data fails due to infinite recursion during pickling.

2019-10-23 Thread Udi Meiri (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-8397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958414#comment-16958414
 ] 

Udi Meiri commented on BEAM-8397:
-

1. The "asubcomponent" entry is replaced by self.fn.display_data(). See this 
code: 
https://github.com/apache/beam/blob/8e574a177794aa04345cf8cd0c2ce1717ee40ec6/sdks/python/apache_beam/transforms/display.py#L99-L103

2. Beam's pickler.py tries to handle nested classes, but can only do so if the 
class is nested within another class. In this case the class is nested inside a 
method.
This code shows that it only recurses into classes:
https://github.com/apache/beam/blob/a757475e2017bbc92871e9d3507843fc0e1a9611/sdks/python/apache_beam/internal/pickler.py#L75

What happens in this test case is still a mystery to me. No idea why it works 
in some cases vs not in others.

Also, if we move SpecialPardo to be module-level, its display_data() is never 
(ParDo.display_data is called instead).

[~robertwb] do you remember the intention for this test? Do users successfully 
subclass ParDo and override display_data in their pipelines?

> DataflowRunnerTest.test_remote_runner_display_data fails due to infinite 
> recursion during pickling.
> ---
>
> Key: BEAM-8397
> URL: https://issues.apache.org/jira/browse/BEAM-8397
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Valentyn Tymofieiev
>Priority: Major
>
> `python ./setup.py test -s 
> apache_beam.runners.dataflow.dataflow_runner_test.DataflowRunnerTest.test_remote_runner_display_data`
>  passes.
> `tox -e py37-gcp` passes if Beam depends on dill==0.3.0, but fails if Beam 
> depends on dill==0.3.1.1.`python ./setup.py nosetests --tests 
> 'apache_beam/runners/dataflow/dataflow_runner_test.py:DataflowRunnerTest.test_remote_runner_display_data`
>  fails currently if run on master.
> The failure indicates infinite recursion during pickling:
> {noformat}
> test_remote_runner_display_data 
> (apache_beam.runners.dataflow.dataflow_runner_test.DataflowRunnerTest) ... 
> Fatal Python error: Cannot recover from stack overflow.
> Current thread 0x7f9d700ed740 (most recent call first):
>   File "/usr/lib/python3.7/pickle.py", line 479 in get
>   File "/usr/lib/python3.7/pickle.py", line 497 in save
>   File "/usr/lib/python3.7/pickle.py", line 786 in save_tuple
>   File "/usr/lib/python3.7/pickle.py", line 504 in save
>   File "/usr/lib/python3.7/pickle.py", line 638 in save_reduce
>   File 
> "/usr/local/google/home/valentyn/tmp/py37env/lib/python3.7/site-packages/dill/_dill.py",
>  line 1394 in save_function
>   File "/usr/lib/python3.7/pickle.py", line 504 in save
>   File "/usr/lib/python3.7/pickle.py", line 882 in _batch_setitems
>   File "/usr/lib/python3.7/pickle.py", line 856 in save_dict
>   File 
> "/usr/local/google/home/valentyn/tmp/py37env/lib/python3.7/site-packages/dill/_dill.py",
>  line 910 in save_module_dict
>   File 
> "/usr/local/google/home/valentyn/projects/beam/clean/beam/sdks/python/apache_beam/internal/pickler.py",
>  line 198 in new_save_module_dict
>   File "/usr/lib/python3.7/pickle.py", line 504 in save
>   File "/usr/lib/python3.7/pickle.py", line 786 in save_tuple
>   File "/usr/lib/python3.7/pickle.py", line 504 in save
>   File "/usr/lib/python3.7/pickle.py", line 638 in save_reduce
>   File 
> "/usr/local/google/home/valentyn/projects/beam/clean/beam/sdks/python/apache_beam/internal/pickler.py",
>  line 114 in wrapper
>   File "/usr/lib/python3.7/pickle.py", line 504 in save
>   File "/usr/lib/python3.7/pickle.py", line 771 in save_tuple
>   File "/usr/lib/python3.7/pickle.py", line 504 in save
>   File "/usr/lib/python3.7/pickle.py", line 638 in save_reduce
>   File 
> "/usr/local/google/home/valentyn/tmp/py37env/lib/python3.7/site-packages/dill/_dill.py",
>  line 1137 in save_cell
>   File "/usr/lib/python3.7/pickle.py", line 504 in save
>   File "/usr/lib/python3.7/pickle.py", line 771 in save_tuple
>   File "/usr/lib/python3.7/pickle.py", line 504 in save
>   File "/usr/lib/python3.7/pickle.py", line 786 in save_tuple
>   File "/usr/lib/python3.7/pickle.py", line 504 in save
>   File "/usr/lib/python3.7/pickle.py", line 638 in save_reduce
>   File 
> "/usr/local/google/home/valentyn/tmp/py37env/lib/python3.7/site-packages/dill/_dill.py",
>  line 1394 in save_function
>   File "/usr/lib/python3.7/pickle.py", line 504 in save
>   File "/usr/lib/python3.7/pickle.py", line 882 in _batch_setitems
>   File "/usr/lib/python3.7/pickle.py", line 856 in save_dict
>   File 
> "/usr/local/google/home/valentyn/tmp/py37env/lib/python3.7/site-packages/dill/_dill.py",
>  line 910 in save_module_dict
>   File 
> 

[jira] [Work logged] (BEAM-8456) BigQuery to Beam SQL timestamp has the wrong default: truncation makes the most sense

2019-10-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8456?focusedWorklogId=332965=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332965
 ]

ASF GitHub Bot logged work on BEAM-8456:


Author: ASF GitHub Bot
Created on: 23/Oct/19 23:21
Start Date: 23/Oct/19 23:21
Worklog Time Spent: 10m 
  Work Description: apilloud commented on pull request #9849: [BEAM-8456] 
Add pipeline option to have Data Catalog truncate sub-millisecond precision
URL: https://github.com/apache/beam/pull/9849#discussion_r338318184
 
 

 ##
 File path: 
sdks/java/extensions/sql/datacatalog/src/main/java/org/apache/beam/sdk/extensions/sql/meta/provider/datacatalog/DataCatalogTableProvider.java
 ##
 @@ -138,8 +143,41 @@ private Table loadTableFromDC(String tableName) {
 }
   }
 
-  @Override
-  public BeamSqlTable buildBeamSqlTable(Table table) {
-return delegateProviders.get(table.getType()).buildBeamSqlTable(table);
+  private static DataCatalogBlockingStub createDataCatalogClient(
+  DataCatalogPipelineOptions options) {
+return DataCatalogGrpc.newBlockingStub(
+
ManagedChannelBuilder.forTarget(options.getDataCatalogEndpoint()).build())
+.withCallCredentials(
+
MoreCallCredentials.from(options.as(GcpOptions.class).getGcpCredential()));
+  }
+
+  private static Map getSupportedProviders() {
+return Stream.of(
+new PubsubJsonTableProvider(), new BigQueryTableProvider(), new 
TextTableProvider())
+.collect(toMap(TableProvider::getTableType, p -> p));
+  }
+
+  private Table toCalciteTable(String tableName, Entry entry) {
+if (entry.getSchema().getColumnsCount() == 0) {
+  throw new UnsupportedOperationException(
+  "Entry doesn't have a schema. Please attach a schema to '"
+  + tableName
+  + "' in Data Catalog: "
+  + entry.toString());
+}
+Schema schema = SchemaUtils.fromDataCatalog(entry.getSchema());
+
+Optional tableBuilder = tableFactory.tableBuilder(entry);
+if (tableBuilder.isPresent()) {
+  return tableBuilder.get().schema(schema).name(tableName).build();
+} else {
 
 Review comment:
   nit: unneeded `else`.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 332965)
Time Spent: 1h 50m  (was: 1h 40m)

> BigQuery to Beam SQL timestamp has the wrong default: truncation makes the 
> most sense
> -
>
> Key: BEAM-8456
> URL: https://issues.apache.org/jira/browse/BEAM-8456
> Project: Beam
>  Issue Type: Improvement
>  Components: dsl-sql
>Reporter: Kenneth Knowles
>Assignee: Kenneth Knowles
>Priority: Major
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Most of the time, a user reading a timestamp from BigQuery with 
> higher-than-millisecond precision timestamps may not even realize that the 
> data source created these high precision timestamps. They are probably 
> timestamps on log entries generated by a system with higher precision.
> If they are using it with Beam SQL, which only supports millisecond 
> precision, it makes sense to "just work" by default.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-8456) BigQuery to Beam SQL timestamp has the wrong default: truncation makes the most sense

2019-10-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8456?focusedWorklogId=332959=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332959
 ]

ASF GitHub Bot logged work on BEAM-8456:


Author: ASF GitHub Bot
Created on: 23/Oct/19 23:21
Start Date: 23/Oct/19 23:21
Worklog Time Spent: 10m 
  Work Description: apilloud commented on pull request #9849: [BEAM-8456] 
Add pipeline option to have Data Catalog truncate sub-millisecond precision
URL: https://github.com/apache/beam/pull/9849#discussion_r338310639
 
 

 ##
 File path: 
sdks/java/extensions/sql/datacatalog/src/main/java/org/apache/beam/sdk/extensions/sql/meta/provider/datacatalog/DataCatalogTableProvider.java
 ##
 @@ -138,8 +143,41 @@ private Table loadTableFromDC(String tableName) {
 }
   }
 
-  @Override
-  public BeamSqlTable buildBeamSqlTable(Table table) {
-return delegateProviders.get(table.getType()).buildBeamSqlTable(table);
+  private static DataCatalogBlockingStub createDataCatalogClient(
+  DataCatalogPipelineOptions options) {
+return DataCatalogGrpc.newBlockingStub(
+
ManagedChannelBuilder.forTarget(options.getDataCatalogEndpoint()).build())
+.withCallCredentials(
+
MoreCallCredentials.from(options.as(GcpOptions.class).getGcpCredential()));
+  }
+
+  private static Map getSupportedProviders() {
+return Stream.of(
+new PubsubJsonTableProvider(), new BigQueryTableProvider(), new 
TextTableProvider())
 
 Review comment:
   nit: This could be inlined or even turned into a constant.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 332959)
Time Spent: 1h 20m  (was: 1h 10m)

> BigQuery to Beam SQL timestamp has the wrong default: truncation makes the 
> most sense
> -
>
> Key: BEAM-8456
> URL: https://issues.apache.org/jira/browse/BEAM-8456
> Project: Beam
>  Issue Type: Improvement
>  Components: dsl-sql
>Reporter: Kenneth Knowles
>Assignee: Kenneth Knowles
>Priority: Major
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Most of the time, a user reading a timestamp from BigQuery with 
> higher-than-millisecond precision timestamps may not even realize that the 
> data source created these high precision timestamps. They are probably 
> timestamps on log entries generated by a system with higher precision.
> If they are using it with Beam SQL, which only supports millisecond 
> precision, it makes sense to "just work" by default.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-8456) BigQuery to Beam SQL timestamp has the wrong default: truncation makes the most sense

2019-10-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8456?focusedWorklogId=332961=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332961
 ]

ASF GitHub Bot logged work on BEAM-8456:


Author: ASF GitHub Bot
Created on: 23/Oct/19 23:21
Start Date: 23/Oct/19 23:21
Worklog Time Spent: 10m 
  Work Description: apilloud commented on pull request #9849: [BEAM-8456] 
Add pipeline option to have Data Catalog truncate sub-millisecond precision
URL: https://github.com/apache/beam/pull/9849#discussion_r338321189
 
 

 ##
 File path: 
sdks/java/extensions/sql/datacatalog/src/main/java/org/apache/beam/sdk/extensions/sql/meta/provider/datacatalog/DataCatalogTableProvider.java
 ##
 @@ -138,8 +143,41 @@ private Table loadTableFromDC(String tableName) {
 }
   }
 
-  @Override
-  public BeamSqlTable buildBeamSqlTable(Table table) {
-return delegateProviders.get(table.getType()).buildBeamSqlTable(table);
+  private static DataCatalogBlockingStub createDataCatalogClient(
+  DataCatalogPipelineOptions options) {
+return DataCatalogGrpc.newBlockingStub(
+
ManagedChannelBuilder.forTarget(options.getDataCatalogEndpoint()).build())
+.withCallCredentials(
+
MoreCallCredentials.from(options.as(GcpOptions.class).getGcpCredential()));
+  }
+
+  private static Map getSupportedProviders() {
+return Stream.of(
+new PubsubJsonTableProvider(), new BigQueryTableProvider(), new 
TextTableProvider())
+.collect(toMap(TableProvider::getTableType, p -> p));
+  }
+
+  private Table toCalciteTable(String tableName, Entry entry) {
+if (entry.getSchema().getColumnsCount() == 0) {
+  throw new UnsupportedOperationException(
+  "Entry doesn't have a schema. Please attach a schema to '"
+  + tableName
+  + "' in Data Catalog: "
+  + entry.toString());
+}
+Schema schema = SchemaUtils.fromDataCatalog(entry.getSchema());
+
+Optional tableBuilder = tableFactory.tableBuilder(entry);
+if (tableBuilder.isPresent()) {
+  return tableBuilder.get().schema(schema).name(tableName).build();
 
 Review comment:
   nit: This pattern gives you all the pain of `Optional` with none of the 
benefits.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 332961)
Time Spent: 1h 40m  (was: 1.5h)

> BigQuery to Beam SQL timestamp has the wrong default: truncation makes the 
> most sense
> -
>
> Key: BEAM-8456
> URL: https://issues.apache.org/jira/browse/BEAM-8456
> Project: Beam
>  Issue Type: Improvement
>  Components: dsl-sql
>Reporter: Kenneth Knowles
>Assignee: Kenneth Knowles
>Priority: Major
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Most of the time, a user reading a timestamp from BigQuery with 
> higher-than-millisecond precision timestamps may not even realize that the 
> data source created these high precision timestamps. They are probably 
> timestamps on log entries generated by a system with higher precision.
> If they are using it with Beam SQL, which only supports millisecond 
> precision, it makes sense to "just work" by default.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-8456) BigQuery to Beam SQL timestamp has the wrong default: truncation makes the most sense

2019-10-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8456?focusedWorklogId=332963=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332963
 ]

ASF GitHub Bot logged work on BEAM-8456:


Author: ASF GitHub Bot
Created on: 23/Oct/19 23:21
Start Date: 23/Oct/19 23:21
Worklog Time Spent: 10m 
  Work Description: apilloud commented on pull request #9849: [BEAM-8456] 
Add pipeline option to have Data Catalog truncate sub-millisecond precision
URL: https://github.com/apache/beam/pull/9849#discussion_r338317396
 
 

 ##
 File path: 
sdks/java/extensions/sql/datacatalog/src/main/java/org/apache/beam/sdk/extensions/sql/meta/provider/datacatalog/TableFactory.java
 ##
 @@ -0,0 +1,26 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.beam.sdk.extensions.sql.meta.provider.datacatalog;
+
+import com.google.cloud.datacatalog.Entry;
+import java.util.Optional;
+import org.apache.beam.sdk.extensions.sql.meta.Table.Builder;
+
+interface TableFactory {
+  Optional tableBuilder(Entry entry);
 
 Review comment:
   It is probably worth documenting this interface. Particularly that the 
`TableFactory` should check to see if the `Entry` belongs to it and return 
`Optional.empty()` if it doesn't.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 332963)
Time Spent: 1h 40m  (was: 1.5h)

> BigQuery to Beam SQL timestamp has the wrong default: truncation makes the 
> most sense
> -
>
> Key: BEAM-8456
> URL: https://issues.apache.org/jira/browse/BEAM-8456
> Project: Beam
>  Issue Type: Improvement
>  Components: dsl-sql
>Reporter: Kenneth Knowles
>Assignee: Kenneth Knowles
>Priority: Major
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Most of the time, a user reading a timestamp from BigQuery with 
> higher-than-millisecond precision timestamps may not even realize that the 
> data source created these high precision timestamps. They are probably 
> timestamps on log entries generated by a system with higher precision.
> If they are using it with Beam SQL, which only supports millisecond 
> precision, it makes sense to "just work" by default.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-8456) BigQuery to Beam SQL timestamp has the wrong default: truncation makes the most sense

2019-10-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8456?focusedWorklogId=332960=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332960
 ]

ASF GitHub Bot logged work on BEAM-8456:


Author: ASF GitHub Bot
Created on: 23/Oct/19 23:21
Start Date: 23/Oct/19 23:21
Worklog Time Spent: 10m 
  Work Description: apilloud commented on pull request #9849: [BEAM-8456] 
Add pipeline option to have Data Catalog truncate sub-millisecond precision
URL: https://github.com/apache/beam/pull/9849#discussion_r338307365
 
 

 ##
 File path: 
sdks/java/extensions/sql/datacatalog/src/main/java/org/apache/beam/sdk/extensions/sql/meta/provider/datacatalog/BigQueryTableFactory.java
 ##
 @@ -20,23 +20,39 @@
 import com.alibaba.fastjson.JSONObject;
 import com.google.cloud.datacatalog.Entry;
 import java.net.URI;
+import java.util.Optional;
 import java.util.regex.Matcher;
 import java.util.regex.Pattern;
 import org.apache.beam.sdk.extensions.sql.meta.Table;
+import org.apache.beam.sdk.extensions.sql.meta.Table.Builder;
+import 
org.apache.beam.vendor.guava.v26_0_jre.com.google.common.collect.ImmutableMap;
 
-/** Utils to extract BQ-specific entry information. */
-class BigQueryUtils {
+/** {@link TableFactory} that understands Data Catalog BigQuery entries. */
+class BigQueryTableFactory implements TableFactory {
+  private static final String BIGQUERY_API = "bigquery.googleapis.com";
 
   private static final Pattern BQ_PATH_PATTERN =
   Pattern.compile(
   
"/projects/(?[^/]+)/datasets/(?[^/]+)/tables/(?[^/]+)");
 
-  static Table.Builder tableBuilder(Entry entry) {
-return Table.builder()
-.location(getLocation(entry))
-.properties(new JSONObject())
-.type("bigquery")
-.comment("");
+  private final boolean truncateTimestamps;
+
+  public BigQueryTableFactory(boolean truncateTimestamps) {
+this.truncateTimestamps = truncateTimestamps;
+  }
+
+  @Override
+  public Optional tableBuilder(Entry entry) {
+if 
(URI.create(entry.getLinkedResource()).getAuthority().toLowerCase().equals(BIGQUERY_API))
 {
+  return Optional.of(
+  Table.builder()
+  .location(getLocation(entry))
+  .properties(new JSONObject(ImmutableMap.of("truncateTimestamps", 
truncateTimestamps)))
+  .type("bigquery")
+  .comment(""));
+} else {
 
 Review comment:
   nit: Unneeded else. This might be clearer if the `if` bailed out.
   
   For example:
   ```
   if (!URI.equals()) {
   return Optional.empty();
   }
   // Actual work
   ```
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 332960)
Time Spent: 1.5h  (was: 1h 20m)

> BigQuery to Beam SQL timestamp has the wrong default: truncation makes the 
> most sense
> -
>
> Key: BEAM-8456
> URL: https://issues.apache.org/jira/browse/BEAM-8456
> Project: Beam
>  Issue Type: Improvement
>  Components: dsl-sql
>Reporter: Kenneth Knowles
>Assignee: Kenneth Knowles
>Priority: Major
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Most of the time, a user reading a timestamp from BigQuery with 
> higher-than-millisecond precision timestamps may not even realize that the 
> data source created these high precision timestamps. They are probably 
> timestamps on log entries generated by a system with higher precision.
> If they are using it with Beam SQL, which only supports millisecond 
> precision, it makes sense to "just work" by default.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-8456) BigQuery to Beam SQL timestamp has the wrong default: truncation makes the most sense

2019-10-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8456?focusedWorklogId=332958=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332958
 ]

ASF GitHub Bot logged work on BEAM-8456:


Author: ASF GitHub Bot
Created on: 23/Oct/19 23:21
Start Date: 23/Oct/19 23:21
Worklog Time Spent: 10m 
  Work Description: apilloud commented on pull request #9849: [BEAM-8456] 
Add pipeline option to have Data Catalog truncate sub-millisecond precision
URL: https://github.com/apache/beam/pull/9849#discussion_r338317511
 
 

 ##
 File path: 
sdks/java/extensions/sql/datacatalog/src/main/java/org/apache/beam/sdk/extensions/sql/meta/provider/datacatalog/PubsubTableFactory.java
 ##
 @@ -20,22 +20,32 @@
 import com.alibaba.fastjson.JSONObject;
 import com.google.cloud.datacatalog.Entry;
 import java.net.URI;
+import java.util.Optional;
 import java.util.regex.Matcher;
 import java.util.regex.Pattern;
 import org.apache.beam.sdk.extensions.sql.meta.Table;
+import org.apache.beam.sdk.extensions.sql.meta.Table.Builder;
 
-/** Utils to extract Pubsub-specific entry information. */
-class PubsubUtils {
+/** {@link TableFactory} that understands Data Catalog Pubsub entries. */
+class PubsubTableFactory implements TableFactory {
+
+  private static final String PUBSUB_API = "pubsub.googleapis.com";
 
   private static final Pattern PS_PATH_PATTERN =
   Pattern.compile("/projects/(?[^/]+)/topics/(?[^/]+)");
 
-  static Table.Builder tableBuilder(Entry entry) {
-return Table.builder()
-.location(getLocation(entry))
-.properties(new JSONObject())
-.type("pubsub")
-.comment("");
+  @Override
+  public Optional tableBuilder(Entry entry) {
+if 
(URI.create(entry.getLinkedResource()).getAuthority().toLowerCase().equals(PUBSUB_API))
 {
+  return Optional.of(
+  Table.builder()
+  .location(getLocation(entry))
+  .properties(new JSONObject())
+  .type("pubsub")
+  .comment(""));
+} else {
 
 Review comment:
   nit: Unneeded else. This might be clearer if the `if` bailed out.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 332958)
Time Spent: 1h 10m  (was: 1h)

> BigQuery to Beam SQL timestamp has the wrong default: truncation makes the 
> most sense
> -
>
> Key: BEAM-8456
> URL: https://issues.apache.org/jira/browse/BEAM-8456
> Project: Beam
>  Issue Type: Improvement
>  Components: dsl-sql
>Reporter: Kenneth Knowles
>Assignee: Kenneth Knowles
>Priority: Major
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Most of the time, a user reading a timestamp from BigQuery with 
> higher-than-millisecond precision timestamps may not even realize that the 
> data source created these high precision timestamps. They are probably 
> timestamps on log entries generated by a system with higher precision.
> If they are using it with Beam SQL, which only supports millisecond 
> precision, it makes sense to "just work" by default.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-8456) BigQuery to Beam SQL timestamp has the wrong default: truncation makes the most sense

2019-10-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8456?focusedWorklogId=332964=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332964
 ]

ASF GitHub Bot logged work on BEAM-8456:


Author: ASF GitHub Bot
Created on: 23/Oct/19 23:21
Start Date: 23/Oct/19 23:21
Worklog Time Spent: 10m 
  Work Description: apilloud commented on pull request #9849: [BEAM-8456] 
Add pipeline option to have Data Catalog truncate sub-millisecond precision
URL: https://github.com/apache/beam/pull/9849#discussion_r338308066
 
 

 ##
 File path: 
sdks/java/extensions/sql/datacatalog/src/main/java/org/apache/beam/sdk/extensions/sql/meta/provider/datacatalog/DataCatalogPipelineOptions.java
 ##
 @@ -32,4 +32,12 @@
   String getDataCatalogEndpoint();
 
   void setDataCatalogEndpoint(String dataCatalogEndpoint);
+
+  /** Whether to truncate timestamps in tables described by Data Catalog. */
+  @Description("Truncate sub-millisecond precision timestamps in tables 
described by Data Catalog")
+  @Validation.Required
+  @Default.Boolean(false)
+  Boolean getTruncateTimestamps();
 
 Review comment:
   Nit: `boolean` should work here and below. (is this a limitation of 
PipelineOptions)?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 332964)
Time Spent: 1h 50m  (was: 1h 40m)

> BigQuery to Beam SQL timestamp has the wrong default: truncation makes the 
> most sense
> -
>
> Key: BEAM-8456
> URL: https://issues.apache.org/jira/browse/BEAM-8456
> Project: Beam
>  Issue Type: Improvement
>  Components: dsl-sql
>Reporter: Kenneth Knowles
>Assignee: Kenneth Knowles
>Priority: Major
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Most of the time, a user reading a timestamp from BigQuery with 
> higher-than-millisecond precision timestamps may not even realize that the 
> data source created these high precision timestamps. They are probably 
> timestamps on log entries generated by a system with higher precision.
> If they are using it with Beam SQL, which only supports millisecond 
> precision, it makes sense to "just work" by default.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-8456) BigQuery to Beam SQL timestamp has the wrong default: truncation makes the most sense

2019-10-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8456?focusedWorklogId=332962=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332962
 ]

ASF GitHub Bot logged work on BEAM-8456:


Author: ASF GitHub Bot
Created on: 23/Oct/19 23:21
Start Date: 23/Oct/19 23:21
Worklog Time Spent: 10m 
  Work Description: apilloud commented on pull request #9849: [BEAM-8456] 
Add pipeline option to have Data Catalog truncate sub-millisecond precision
URL: https://github.com/apache/beam/pull/9849#discussion_r338308594
 
 

 ##
 File path: 
sdks/java/extensions/sql/datacatalog/src/main/java/org/apache/beam/sdk/extensions/sql/meta/provider/datacatalog/DataCatalogPipelineOptions.java
 ##
 @@ -32,4 +32,12 @@
   String getDataCatalogEndpoint();
 
   void setDataCatalogEndpoint(String dataCatalogEndpoint);
+
+  /** Whether to truncate timestamps in tables described by Data Catalog. */
 
 Review comment:
   Why is this applied at the Data Catalog level and not the Beam SQL level? It 
seems this is something every table provider should support. (It would be easy 
to expand later, I'm fine leaving it for now.)
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 332962)
Time Spent: 1h 40m  (was: 1.5h)

> BigQuery to Beam SQL timestamp has the wrong default: truncation makes the 
> most sense
> -
>
> Key: BEAM-8456
> URL: https://issues.apache.org/jira/browse/BEAM-8456
> Project: Beam
>  Issue Type: Improvement
>  Components: dsl-sql
>Reporter: Kenneth Knowles
>Assignee: Kenneth Knowles
>Priority: Major
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Most of the time, a user reading a timestamp from BigQuery with 
> higher-than-millisecond precision timestamps may not even realize that the 
> data source created these high precision timestamps. They are probably 
> timestamps on log entries generated by a system with higher precision.
> If they are using it with Beam SQL, which only supports millisecond 
> precision, it makes sense to "just work" by default.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-8368) [Python] libprotobuf-generated exception when importing apache_beam

2019-10-23 Thread Brian Hulette (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brian Hulette updated BEAM-8368:

Fix Version/s: (was: 2.17.0)
   2.18.0

> [Python] libprotobuf-generated exception when importing apache_beam
> ---
>
> Key: BEAM-8368
> URL: https://issues.apache.org/jira/browse/BEAM-8368
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Affects Versions: 2.15.0, 2.17.0
>Reporter: Ubaier Bhat
>Assignee: Brian Hulette
>Priority: Blocker
> Fix For: 2.18.0
>
> Attachments: error_log.txt
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Unable to import apache_beam after upgrading to macos 10.15 (Catalina). 
> Cleared all the pipenvs and but can't get it working again.
> {code}
> import apache_beam as beam
> /Users/***/.local/share/virtualenvs/beam-etl-ims6DitU/lib/python3.7/site-packages/apache_beam/__init__.py:84:
>  UserWarning: Some syntactic constructs of Python 3 are not yet fully 
> supported by Apache Beam.
>   'Some syntactic constructs of Python 3 are not yet fully supported by '
> [libprotobuf ERROR google/protobuf/descriptor_database.cc:58] File already 
> exists in database: 
> [libprotobuf FATAL google/protobuf/descriptor.cc:1370] CHECK failed: 
> GeneratedDatabase()->Add(encoded_file_descriptor, size): 
> libc++abi.dylib: terminating with uncaught exception of type 
> google::protobuf::FatalException: CHECK failed: 
> GeneratedDatabase()->Add(encoded_file_descriptor, size): 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-7886) Make row coder a standard coder and implement in python

2019-10-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7886?focusedWorklogId=332945=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332945
 ]

ASF GitHub Bot logged work on BEAM-7886:


Author: ASF GitHub Bot
Created on: 23/Oct/19 22:36
Start Date: 23/Oct/19 22:36
Worklog Time Spent: 10m 
  Work Description: TheNeuralBit commented on issue #9188: [BEAM-7886] Make 
row coder a standard coder and implement in Python
URL: https://github.com/apache/beam/pull/9188#issuecomment-545664933
 
 
   Run Python PreCommit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 332945)
Time Spent: 14h  (was: 13h 50m)

> Make row coder a standard coder and implement in python
> ---
>
> Key: BEAM-7886
> URL: https://issues.apache.org/jira/browse/BEAM-7886
> Project: Beam
>  Issue Type: Improvement
>  Components: beam-model, sdk-java-core, sdk-py-core
>Reporter: Brian Hulette
>Assignee: Brian Hulette
>Priority: Major
>  Time Spent: 14h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-8451) Interactive Beam example failing from stack overflow

2019-10-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8451?focusedWorklogId=332941=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332941
 ]

ASF GitHub Bot logged work on BEAM-8451:


Author: ASF GitHub Bot
Created on: 23/Oct/19 22:32
Start Date: 23/Oct/19 22:32
Worklog Time Spent: 10m 
  Work Description: chunyang commented on pull request #9865: [BEAM-8451] 
Fix interactive beam max recursion err
URL: https://github.com/apache/beam/pull/9865
 
 
   When a pipeline contains a PTransform which passes through its input 
PCollection during `expand()`, the `PipelineInfo` object may run into an 
infinite recursion error while constructing the derivation for said PCollection.
   
   This PR prevents the infinite recursion by not marking a transform as a 
producer of PCollection if that PCollection is also an input to the transform.
   
   I've added a test for this, but without the fix, the test does not fail 100% 
of the time--it's dependent on the ordering of transforms in the Pipeline 
proto. Any advice on how to test this better is appreciated.
   
   
   
   Thank you for your contribution! Follow this checklist to help us 
incorporate your contribution quickly and easily:
   
- [x] [**Choose 
reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and 
mention them in a comment (`R: @username`).
   R: @aaltay 
   R: @robertwb 
   R: @charlesccychen  
- [x] Format the pull request title like `[BEAM-XXX] Fixes bug in 
ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA 
issue, if applicable. This will automatically link the pull request to the 
issue.
- [ ] If this contribution is large, please file an Apache [Individual 
Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   See the [Contributor Guide](https://beam.apache.org/contribute) for more 
tips on [how to make review process 
smoother](https://beam.apache.org/contribute/#make-reviewers-job-easier).
   
   Post-Commit Tests Status (on master branch)
   

   
   Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark
   --- | --- | --- | --- | --- | --- | --- | ---
   Go | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/)
 | --- | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/)
 | --- | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/)
   Java | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)[![Build
 

[jira] [Work logged] (BEAM-8428) [SQL] BigQuery should support project push-down in DIRECT_READ mode

2019-10-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8428?focusedWorklogId=332936=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332936
 ]

ASF GitHub Bot logged work on BEAM-8428:


Author: ASF GitHub Bot
Created on: 23/Oct/19 22:18
Start Date: 23/Oct/19 22:18
Worklog Time Spent: 10m 
  Work Description: 11moon11 commented on pull request #9864: [BEAM-8428] 
[SQL] [WIP] BigQuery should return row fields in the selected order
URL: https://github.com/apache/beam/pull/9864
 
 
   This problem occurs when `BeamIOPushDownRule` drops the Calc. 
   BigQueryIO by default projects fields in the order they are present in the 
schema, but the expected behavior is to project fields in the order they are 
selected in the SQL statement.
   Also, `supportsProjects` should only support project push-down in 
`DIRECT_READ` method.
   
   Based on top of #9863.
   
   
   
   Thank you for your contribution! Follow this checklist to help us 
incorporate your contribution quickly and easily:
   
- [ ] [**Choose 
reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and 
mention them in a comment (`R: @username`).
- [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in 
ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA 
issue, if applicable. This will automatically link the pull request to the 
issue.
- [ ] If this contribution is large, please file an Apache [Individual 
Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   See the [Contributor Guide](https://beam.apache.org/contribute) for more 
tips on [how to make review process 
smoother](https://beam.apache.org/contribute/#make-reviewers-job-easier).
   
   

   
   Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark
   --- | --- | --- | --- | --- | --- | --- | ---
   Go | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/)
 | --- | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/)
 | --- | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/)
   Java | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/)
   Python | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/)[![Build
 

[jira] [Commented] (BEAM-8368) [Python] libprotobuf-generated exception when importing apache_beam

2019-10-23 Thread Ahmet Altay (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-8368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958282#comment-16958282
 ] 

Ahmet Altay commented on BEAM-8368:
---

Let's move to 2.18 and keep watching the pyarrow update. (For reference arrow 
thread is 
[https://lists.apache.org/thread.html/462464941d79e67bfef7d73fc900ea5fe0200e8a926253c6eb0285aa@%3Cdev.arrow.apache.org%3E])

 

If there will be Beam 2.17 RC2 maybe we can reconsider at that time.

> [Python] libprotobuf-generated exception when importing apache_beam
> ---
>
> Key: BEAM-8368
> URL: https://issues.apache.org/jira/browse/BEAM-8368
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Affects Versions: 2.15.0, 2.17.0
>Reporter: Ubaier Bhat
>Assignee: Brian Hulette
>Priority: Blocker
> Fix For: 2.17.0
>
> Attachments: error_log.txt
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Unable to import apache_beam after upgrading to macos 10.15 (Catalina). 
> Cleared all the pipenvs and but can't get it working again.
> {code}
> import apache_beam as beam
> /Users/***/.local/share/virtualenvs/beam-etl-ims6DitU/lib/python3.7/site-packages/apache_beam/__init__.py:84:
>  UserWarning: Some syntactic constructs of Python 3 are not yet fully 
> supported by Apache Beam.
>   'Some syntactic constructs of Python 3 are not yet fully supported by '
> [libprotobuf ERROR google/protobuf/descriptor_database.cc:58] File already 
> exists in database: 
> [libprotobuf FATAL google/protobuf/descriptor.cc:1370] CHECK failed: 
> GeneratedDatabase()->Add(encoded_file_descriptor, size): 
> libc++abi.dylib: terminating with uncaught exception of type 
> google::protobuf::FatalException: CHECK failed: 
> GeneratedDatabase()->Add(encoded_file_descriptor, size): 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-7926) Visualize PCollection with Interactive Beam

2019-10-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7926?focusedWorklogId=332920=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332920
 ]

ASF GitHub Bot logged work on BEAM-7926:


Author: ASF GitHub Bot
Created on: 23/Oct/19 21:50
Start Date: 23/Oct/19 21:50
Worklog Time Spent: 10m 
  Work Description: KevinGG commented on issue #9741: [BEAM-7926] Visualize 
PCollection
URL: https://github.com/apache/beam/pull/9741#issuecomment-545650837
 
 
   I've added two tox env and corresponding Py36 and Py37 test suite. Removed 
[interactive] from test required packages. Added it to py2:docs env.
   I'll add [interactive] for other specific envs defined in tox.ini if any of 
the gradle task needs it.
   
   I've also added checks and warnings in 
`interactive_environment.InteractiveEnvironment` to check prerequisites for 
interactive beam (didn't put them in the global scope because global scoped 
warning level logging will fail gradle tasks in pre-commit):
   1. If Python is above 3.5.3;
   2. If [interactive] dependencies are fully available;
   3. If current runtime is within an interactive environment, i.e., within an 
ipython kernel.
   
   For the `time.sleep()` and `pcoll_visualization` NOOP strategy when 
dependencies are not fully installed, please let me know your preference, 
@aaltay. Personally I'm fine with either approaches, and I'll make change if 
needed.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 332920)
Time Spent: 13h 50m  (was: 13h 40m)

> Visualize PCollection with Interactive Beam
> ---
>
> Key: BEAM-7926
> URL: https://issues.apache.org/jira/browse/BEAM-7926
> Project: Beam
>  Issue Type: New Feature
>  Components: runner-py-interactive
>Reporter: Ning Kang
>Assignee: Ning Kang
>Priority: Major
>  Time Spent: 13h 50m
>  Remaining Estimate: 0h
>
> Support auto plotting / charting of materialized data of a given PCollection 
> with Interactive Beam.
> Say an Interactive Beam pipeline defined as
> p = create_pipeline()
> pcoll = p | 'Transform' >> transform()
> The use can call a single function and get auto-magical charting of the data 
> as materialized pcoll.
> e.g., visualize(pcoll)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-7926) Visualize PCollection with Interactive Beam

2019-10-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7926?focusedWorklogId=332918=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332918
 ]

ASF GitHub Bot logged work on BEAM-7926:


Author: ASF GitHub Bot
Created on: 23/Oct/19 21:49
Start Date: 23/Oct/19 21:49
Worklog Time Spent: 10m 
  Work Description: KevinGG commented on issue #9741: [BEAM-7926] Visualize 
PCollection
URL: https://github.com/apache/beam/pull/9741#issuecomment-545650837
 
 
   I've added two tox env and corresponding Py36 and Py37 test suite. Removed 
[interactive] from test required packages. Added it to py2:docs env.
   I'll add [interactive] for other specific envs defined in tox.ini if any of 
the gradle task needs it.
   
   I've also added check and warnings in 
`interactive_environment.InteractiveEnvironment` to check prerequisites for 
interactive beam (didn't put them in the global scope because global scoped 
warning level logging will fail gradle tasks in pre-commit):
   1. If Python is above 3.5.3;
   2. If [interactive] dependencies are fully available;
   3. If current runtime is within an interactive environment, i.e., within an 
ipython kernel.
   
   For the `time.sleep()` and `pcoll_visualization` NOOP strategy when 
dependencies are not fully installed, please let me know your preference, 
@aaltay. Personally I'm fine with either approaches, and I'll make change if 
needed.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 332918)
Time Spent: 13h 40m  (was: 13.5h)

> Visualize PCollection with Interactive Beam
> ---
>
> Key: BEAM-7926
> URL: https://issues.apache.org/jira/browse/BEAM-7926
> Project: Beam
>  Issue Type: New Feature
>  Components: runner-py-interactive
>Reporter: Ning Kang
>Assignee: Ning Kang
>Priority: Major
>  Time Spent: 13h 40m
>  Remaining Estimate: 0h
>
> Support auto plotting / charting of materialized data of a given PCollection 
> with Interactive Beam.
> Say an Interactive Beam pipeline defined as
> p = create_pipeline()
> pcoll = p | 'Transform' >> transform()
> The use can call a single function and get auto-magical charting of the data 
> as materialized pcoll.
> e.g., visualize(pcoll)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (BEAM-7842) Add Python 2 deprecation warnings starting from 2.17.0 release.

2019-10-23 Thread Valentyn Tymofieiev (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Valentyn Tymofieiev resolved BEAM-7842.
---
Fix Version/s: (was: 2.17.0)
   2.16.0
   Resolution: Fixed

> Add Python 2 deprecation warnings starting from 2.17.0 release.
> ---
>
> Key: BEAM-7842
> URL: https://issues.apache.org/jira/browse/BEAM-7842
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Valentyn Tymofieiev
>Priority: Minor
> Fix For: 2.16.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-7842) Add Python 2 deprecation warnings starting from 2.17.0 release.

2019-10-23 Thread Valentyn Tymofieiev (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-7842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958269#comment-16958269
 ] 

Valentyn Tymofieiev commented on BEAM-7842:
---

We added the warnings starting from 2.16.0.

> Add Python 2 deprecation warnings starting from 2.17.0 release.
> ---
>
> Key: BEAM-7842
> URL: https://issues.apache.org/jira/browse/BEAM-7842
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Valentyn Tymofieiev
>Priority: Minor
> Fix For: 2.17.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-8341) basic bundling support for samza portable runner

2019-10-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8341?focusedWorklogId=332913=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332913
 ]

ASF GitHub Bot logged work on BEAM-8341:


Author: ASF GitHub Bot
Created on: 23/Oct/19 21:29
Start Date: 23/Oct/19 21:29
Worklog Time Spent: 10m 
  Work Description: xinyuiscool commented on pull request #9777: 
[BEAM-8341]: basic bundling support for portable runner
URL: https://github.com/apache/beam/pull/9777#discussion_r338287932
 
 

 ##
 File path: 
runners/samza/src/main/java/org/apache/beam/runners/samza/translation/ParDoBoundMultiTranslator.java
 ##
 @@ -76,11 +79,36 @@
 doFnInvokerRegistrar = invokerReg.hasNext() ? 
Iterators.getOnlyElement(invokerReg) : null;
   }
 
+  /*
+   * Perform some common pipeline option validation. Bundling related logic 
for now
+   */
+  private static void validatePipelineOption(
 
 Review comment:
   we actually have a SamzaPipelineOPtionsValidator. Can we move the logic 
there?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 332913)
Time Spent: 1h 50m  (was: 1h 40m)

> basic bundling support for samza portable runner
> 
>
> Key: BEAM-8341
> URL: https://issues.apache.org/jira/browse/BEAM-8341
> Project: Beam
>  Issue Type: Task
>  Components: runner-samza
>Reporter: Hai Lu
>Assignee: Hai Lu
>Priority: Major
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> bundling support for samza portable runner



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-8446) apache_beam.io.gcp.bigquery_write_it_test.BigQueryWriteIntegrationTests.test_big_query_write_new_types is flaky

2019-10-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8446?focusedWorklogId=332905=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332905
 ]

ASF GitHub Bot logged work on BEAM-8446:


Author: ASF GitHub Bot
Created on: 23/Oct/19 21:22
Start Date: 23/Oct/19 21:22
Worklog Time Spent: 10m 
  Work Description: pabloem commented on issue #9855: [BEAM-8446] Retrying 
BQ query on timeouts
URL: https://github.com/apache/beam/pull/9855#issuecomment-545642130
 
 
   Run Python 3.7 PostCommit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 332905)
Time Spent: 2h 50m  (was: 2h 40m)

> apache_beam.io.gcp.bigquery_write_it_test.BigQueryWriteIntegrationTests.test_big_query_write_new_types
>  is flaky
> ---
>
> Key: BEAM-8446
> URL: https://issues.apache.org/jira/browse/BEAM-8446
> Project: Beam
>  Issue Type: New Feature
>  Components: test-failures
>Reporter: Boyuan Zhang
>Assignee: Pablo Estrada
>Priority: Major
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> test_big_query_write_new_types appears to be flaky in 
> beam_PostCommit_Python37 test suite.
> https://builds.apache.org/job/beam_PostCommit_Python37/733/
> https://builds.apache.org/job/beam_PostCommit_Python37/739/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-7926) Visualize PCollection with Interactive Beam

2019-10-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7926?focusedWorklogId=332904=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332904
 ]

ASF GitHub Bot logged work on BEAM-7926:


Author: ASF GitHub Bot
Created on: 23/Oct/19 21:21
Start Date: 23/Oct/19 21:21
Worklog Time Spent: 10m 
  Work Description: pabloem commented on issue #9771: [BEAM-7926] Update 
dependencies in Java Katas
URL: https://github.com/apache/beam/pull/9771#issuecomment-545641711
 
 
   Alright. LGTM. I'll just ping @henryken once more, and merge in a  couple 
days if he doesn't answer.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 332904)
Time Spent: 13.5h  (was: 13h 20m)

> Visualize PCollection with Interactive Beam
> ---
>
> Key: BEAM-7926
> URL: https://issues.apache.org/jira/browse/BEAM-7926
> Project: Beam
>  Issue Type: New Feature
>  Components: runner-py-interactive
>Reporter: Ning Kang
>Assignee: Ning Kang
>Priority: Major
>  Time Spent: 13.5h
>  Remaining Estimate: 0h
>
> Support auto plotting / charting of materialized data of a given PCollection 
> with Interactive Beam.
> Say an Interactive Beam pipeline defined as
> p = create_pipeline()
> pcoll = p | 'Transform' >> transform()
> The use can call a single function and get auto-magical charting of the data 
> as materialized pcoll.
> e.g., visualize(pcoll)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-4775) JobService should support returning metrics

2019-10-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-4775?focusedWorklogId=332896=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332896
 ]

ASF GitHub Bot logged work on BEAM-4775:


Author: ASF GitHub Bot
Created on: 23/Oct/19 21:12
Start Date: 23/Oct/19 21:12
Worklog Time Spent: 10m 
  Work Description: ajamato commented on pull request #9843: [BEAM-4775] 
Converting MonitoringInfos to MetricResults in PortableRunner
URL: https://github.com/apache/beam/pull/9843#discussion_r338280812
 
 

 ##
 File path: sdks/python/apache_beam/testing/load_tests/load_test.py
 ##
 @@ -52,32 +52,24 @@ def setUp(self):
 self.input_options = json.loads(self.pipeline.get_option('input_options'))
 self.project_id = self.pipeline.get_option('project')
 
-self.publish_to_big_query = 
self.pipeline.get_option('publish_to_big_query')
 self.metrics_dataset = self.pipeline.get_option('metrics_dataset')
 self.metrics_namespace = self.pipeline.get_option('metrics_table')
 
-if not self.are_metrics_collected():
-  logging.info('Metrics will not be collected')
-  self.metrics_monitor = None
-else:
-  self.metrics_monitor = MetricsReader(
-  project_name=self.project_id,
-  bq_table=self.metrics_namespace,
-  bq_dataset=self.metrics_dataset,
-  )
+self.metrics_monitor = MetricsReader(
 
 Review comment:
   I am not familiar with MetricsReader
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 332896)
Time Spent: 51h 20m  (was: 51h 10m)

> JobService should support returning metrics
> ---
>
> Key: BEAM-4775
> URL: https://issues.apache.org/jira/browse/BEAM-4775
> Project: Beam
>  Issue Type: Bug
>  Components: beam-model
>Reporter: Eugene Kirpichov
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 51h 20m
>  Remaining Estimate: 0h
>
> Design doc: [https://s.apache.org/get-metrics-api].
> Further discussion is ongoing on [this 
> doc|https://docs.google.com/document/d/1m83TsFvJbOlcLfXVXprQm1B7vUakhbLZMzuRrOHWnTg/edit?ts=5c826bb4#heading=h.faqan9rjc6dm].
> We want to report job metrics back to the portability harness from the runner 
> harness, for displaying to users.
> h1. Relevant PRs in flight:
> h2. Ready for Review:
>  * [#8022|https://github.com/apache/beam/pull/8022]: correct the Job RPC 
> protos from [#8018|https://github.com/apache/beam/pull/8018].
> h2. Iterating / Discussing:
>  * [#7971|https://github.com/apache/beam/pull/7971]: Flink portable metrics: 
> get ptransform from MonitoringInfo, not stage name
>  ** this is a simpler, Flink-specific PR that is basically duplicated inside 
> each of the following two, so may be worth trying to merge in first
>  * #[7915|https://github.com/apache/beam/pull/7915]: use MonitoringInfo data 
> model in Java SDK metrics
>  * [#7868|https://github.com/apache/beam/pull/7868]: MonitoringInfo URN tweaks
> h2. Merged
>  * [#8018|https://github.com/apache/beam/pull/8018]: add job metrics RPC 
> protos
>  * [#7867|https://github.com/apache/beam/pull/7867]: key MetricResult by a 
> MetricKey
>  * [#7938|https://github.com/apache/beam/pull/7938]: move MonitoringInfo 
> protos to model/pipeline module
>  * [#7883|https://github.com/apache/beam/pull/7883]: Add 
> MetricQueryResults.allMetrics() helper
>  * [#7866|https://github.com/apache/beam/pull/7866]: move function helpers 
> from fn-harness to sdks/java/core
>  * [#7890|https://github.com/apache/beam/pull/7890]: consolidate MetricResult 
> implementations
> h2. Closed
>  * [#7934|https://github.com/apache/beam/pull/7934]: job metrics RPC + SDK 
> support
>  * [#7876|https://github.com/apache/beam/pull/7876]: Clean up metric protos; 
> support integer distributions, gauges



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-4775) JobService should support returning metrics

2019-10-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-4775?focusedWorklogId=332895=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332895
 ]

ASF GitHub Bot logged work on BEAM-4775:


Author: ASF GitHub Bot
Created on: 23/Oct/19 21:12
Start Date: 23/Oct/19 21:12
Worklog Time Spent: 10m 
  Work Description: ajamato commented on pull request #9843: [BEAM-4775] 
Converting MonitoringInfos to MetricResults in PortableRunner
URL: https://github.com/apache/beam/pull/9843#discussion_r33828
 
 

 ##
 File path: sdks/python/apache_beam/runners/portability/portable_runner.py
 ##
 @@ -361,16 +363,31 @@ def add_runner_options(parser):
   state_stream, cleanup_callbacks)
 
 
-class PortableMetrics(metrics.metric.MetricResults):
+class PortableMetrics(metric.MetricResults,
+  portable_metrics.ParseMonitoringInfoMixin):
 
 Review comment:
   Please remove the use of double inheritance. You can use the methods from 
portable_metrics.ParseMonitoringInfoMixin by  making that file define the 
methods in the module instead of a class, since it only defined static methods
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 332895)
Time Spent: 51h 10m  (was: 51h)

> JobService should support returning metrics
> ---
>
> Key: BEAM-4775
> URL: https://issues.apache.org/jira/browse/BEAM-4775
> Project: Beam
>  Issue Type: Bug
>  Components: beam-model
>Reporter: Eugene Kirpichov
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 51h 10m
>  Remaining Estimate: 0h
>
> Design doc: [https://s.apache.org/get-metrics-api].
> Further discussion is ongoing on [this 
> doc|https://docs.google.com/document/d/1m83TsFvJbOlcLfXVXprQm1B7vUakhbLZMzuRrOHWnTg/edit?ts=5c826bb4#heading=h.faqan9rjc6dm].
> We want to report job metrics back to the portability harness from the runner 
> harness, for displaying to users.
> h1. Relevant PRs in flight:
> h2. Ready for Review:
>  * [#8022|https://github.com/apache/beam/pull/8022]: correct the Job RPC 
> protos from [#8018|https://github.com/apache/beam/pull/8018].
> h2. Iterating / Discussing:
>  * [#7971|https://github.com/apache/beam/pull/7971]: Flink portable metrics: 
> get ptransform from MonitoringInfo, not stage name
>  ** this is a simpler, Flink-specific PR that is basically duplicated inside 
> each of the following two, so may be worth trying to merge in first
>  * #[7915|https://github.com/apache/beam/pull/7915]: use MonitoringInfo data 
> model in Java SDK metrics
>  * [#7868|https://github.com/apache/beam/pull/7868]: MonitoringInfo URN tweaks
> h2. Merged
>  * [#8018|https://github.com/apache/beam/pull/8018]: add job metrics RPC 
> protos
>  * [#7867|https://github.com/apache/beam/pull/7867]: key MetricResult by a 
> MetricKey
>  * [#7938|https://github.com/apache/beam/pull/7938]: move MonitoringInfo 
> protos to model/pipeline module
>  * [#7883|https://github.com/apache/beam/pull/7883]: Add 
> MetricQueryResults.allMetrics() helper
>  * [#7866|https://github.com/apache/beam/pull/7866]: move function helpers 
> from fn-harness to sdks/java/core
>  * [#7890|https://github.com/apache/beam/pull/7890]: consolidate MetricResult 
> implementations
> h2. Closed
>  * [#7934|https://github.com/apache/beam/pull/7934]: job metrics RPC + SDK 
> support
>  * [#7876|https://github.com/apache/beam/pull/7876]: Clean up metric protos; 
> support integer distributions, gauges



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-4775) JobService should support returning metrics

2019-10-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-4775?focusedWorklogId=332894=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332894
 ]

ASF GitHub Bot logged work on BEAM-4775:


Author: ASF GitHub Bot
Created on: 23/Oct/19 21:12
Start Date: 23/Oct/19 21:12
Worklog Time Spent: 10m 
  Work Description: ajamato commented on pull request #9843: [BEAM-4775] 
Converting MonitoringInfos to MetricResults in PortableRunner
URL: https://github.com/apache/beam/pull/9843#discussion_r338278403
 
 

 ##
 File path: sdks/python/apache_beam/runners/portability/portable_metrics.py
 ##
 @@ -0,0 +1,74 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from __future__ import absolute_import
+
+from apache_beam.metrics import monitoring_infos
+from apache_beam.metrics.execution import MetricKey
+from apache_beam.metrics.metric import MetricName
+
+
+class ParseMonitoringInfoMixin(object):
+  @staticmethod
+  def from_monitoring_infos(monitoring_info_list, user_metrics_only=False):
+"""Groups MonitoringInfo objects into counters, distributions and gauges.
+
+Args:
+  monitoring_info_list: An iterable of MonitoringInfo objects.
+  user_metrics_only: If true, includes user metrics only.
 
 Review comment:
   Please add a Returns section to pydoc comment
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 332894)
Time Spent: 51h  (was: 50h 50m)

> JobService should support returning metrics
> ---
>
> Key: BEAM-4775
> URL: https://issues.apache.org/jira/browse/BEAM-4775
> Project: Beam
>  Issue Type: Bug
>  Components: beam-model
>Reporter: Eugene Kirpichov
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 51h
>  Remaining Estimate: 0h
>
> Design doc: [https://s.apache.org/get-metrics-api].
> Further discussion is ongoing on [this 
> doc|https://docs.google.com/document/d/1m83TsFvJbOlcLfXVXprQm1B7vUakhbLZMzuRrOHWnTg/edit?ts=5c826bb4#heading=h.faqan9rjc6dm].
> We want to report job metrics back to the portability harness from the runner 
> harness, for displaying to users.
> h1. Relevant PRs in flight:
> h2. Ready for Review:
>  * [#8022|https://github.com/apache/beam/pull/8022]: correct the Job RPC 
> protos from [#8018|https://github.com/apache/beam/pull/8018].
> h2. Iterating / Discussing:
>  * [#7971|https://github.com/apache/beam/pull/7971]: Flink portable metrics: 
> get ptransform from MonitoringInfo, not stage name
>  ** this is a simpler, Flink-specific PR that is basically duplicated inside 
> each of the following two, so may be worth trying to merge in first
>  * #[7915|https://github.com/apache/beam/pull/7915]: use MonitoringInfo data 
> model in Java SDK metrics
>  * [#7868|https://github.com/apache/beam/pull/7868]: MonitoringInfo URN tweaks
> h2. Merged
>  * [#8018|https://github.com/apache/beam/pull/8018]: add job metrics RPC 
> protos
>  * [#7867|https://github.com/apache/beam/pull/7867]: key MetricResult by a 
> MetricKey
>  * [#7938|https://github.com/apache/beam/pull/7938]: move MonitoringInfo 
> protos to model/pipeline module
>  * [#7883|https://github.com/apache/beam/pull/7883]: Add 
> MetricQueryResults.allMetrics() helper
>  * [#7866|https://github.com/apache/beam/pull/7866]: move function helpers 
> from fn-harness to sdks/java/core
>  * [#7890|https://github.com/apache/beam/pull/7890]: consolidate MetricResult 
> implementations
> h2. Closed
>  * [#7934|https://github.com/apache/beam/pull/7934]: job metrics RPC + SDK 
> support
>  * [#7876|https://github.com/apache/beam/pull/7876]: Clean up metric protos; 
> support integer distributions, gauges



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-4775) JobService should support returning metrics

2019-10-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-4775?focusedWorklogId=332898=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332898
 ]

ASF GitHub Bot logged work on BEAM-4775:


Author: ASF GitHub Bot
Created on: 23/Oct/19 21:12
Start Date: 23/Oct/19 21:12
Worklog Time Spent: 10m 
  Work Description: ajamato commented on pull request #9843: [BEAM-4775] 
Converting MonitoringInfos to MetricResults in PortableRunner
URL: https://github.com/apache/beam/pull/9843#discussion_r338279475
 
 

 ##
 File path: sdks/python/apache_beam/runners/portability/portable_metrics.py
 ##
 @@ -0,0 +1,74 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from __future__ import absolute_import
+
+from apache_beam.metrics import monitoring_infos
+from apache_beam.metrics.execution import MetricKey
+from apache_beam.metrics.metric import MetricName
+
+
+class ParseMonitoringInfoMixin(object):
+  @staticmethod
+  def from_monitoring_infos(monitoring_info_list, user_metrics_only=False):
+"""Groups MonitoringInfo objects into counters, distributions and gauges.
+
+Args:
+  monitoring_info_list: An iterable of MonitoringInfo objects.
+  user_metrics_only: If true, includes user metrics only.
+"""
+counters = {}
+distributions = {}
+gauges = {}
+
+for mi in monitoring_info_list:
+  if (user_metrics_only and
+  not monitoring_infos.is_user_monitoring_info(mi)):
+continue
+
+  key = ParseMonitoringInfoMixin._create_metric_key(mi)
+  metric_result = (monitoring_infos.extract_metric_result_map_value(mi))
+
+  if monitoring_infos.is_counter(mi):
+counters[key] = metric_result
+  elif monitoring_infos.is_distribution(mi):
+distributions[key] = metric_result
+  elif monitoring_infos.is_gauge(mi):
+gauges[key] = metric_result
+
+return counters, distributions, gauges
+
+  @staticmethod
+  def _create_metric_key(monitoring_info):
+step_name = ParseMonitoringInfoMixin._get_step_name(monitoring_info)
+namespace, name = 
monitoring_infos.parse_namespace_and_name(monitoring_info)
+return MetricKey(step_name, MetricName(namespace, name))
+
+  @staticmethod
+  def _get_step_name(monitoring_info):
+keys_to_check = [monitoring_infos.PTRANSFORM_LABEL,
 
 Review comment:
   This doesn't seem correct. Why would you consider the step name to be under 
labels other than PTRANSFORM_LABEL
   
   see montioring_info specs defined here
   
https://github.com/apache/beam/blob/d4afbabf38a3ab557625c9c091ed5f06ca6731ce/model/pipeline/src/main/proto/metrics.proto
   
   PTRANSFORM_LABEL is the only one used for this purpose
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 332898)
Time Spent: 51.5h  (was: 51h 20m)

> JobService should support returning metrics
> ---
>
> Key: BEAM-4775
> URL: https://issues.apache.org/jira/browse/BEAM-4775
> Project: Beam
>  Issue Type: Bug
>  Components: beam-model
>Reporter: Eugene Kirpichov
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 51.5h
>  Remaining Estimate: 0h
>
> Design doc: [https://s.apache.org/get-metrics-api].
> Further discussion is ongoing on [this 
> doc|https://docs.google.com/document/d/1m83TsFvJbOlcLfXVXprQm1B7vUakhbLZMzuRrOHWnTg/edit?ts=5c826bb4#heading=h.faqan9rjc6dm].
> We want to report job metrics back to the portability harness from the runner 
> harness, for displaying to users.
> h1. Relevant PRs in flight:
> h2. Ready for Review:
>  * [#8022|https://github.com/apache/beam/pull/8022]: correct the Job RPC 
> protos from [#8018|https://github.com/apache/beam/pull/8018].
> h2. Iterating / Discussing:
>  * 

[jira] [Work logged] (BEAM-4775) JobService should support returning metrics

2019-10-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-4775?focusedWorklogId=332899=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332899
 ]

ASF GitHub Bot logged work on BEAM-4775:


Author: ASF GitHub Bot
Created on: 23/Oct/19 21:12
Start Date: 23/Oct/19 21:12
Worklog Time Spent: 10m 
  Work Description: ajamato commented on pull request #9843: [BEAM-4775] 
Converting MonitoringInfos to MetricResults in PortableRunner
URL: https://github.com/apache/beam/pull/9843#discussion_r338280472
 
 

 ##
 File path: sdks/python/apache_beam/runners/portability/portable_metrics.py
 ##
 @@ -0,0 +1,74 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from __future__ import absolute_import
+
+from apache_beam.metrics import monitoring_infos
+from apache_beam.metrics.execution import MetricKey
+from apache_beam.metrics.metric import MetricName
+
+
+class ParseMonitoringInfoMixin(object):
 
 Review comment:
   Please remove this class and instead define the methods on a module, without 
a class. All the methods here are static, so there is no need for a 
class/object instance,
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 332899)

> JobService should support returning metrics
> ---
>
> Key: BEAM-4775
> URL: https://issues.apache.org/jira/browse/BEAM-4775
> Project: Beam
>  Issue Type: Bug
>  Components: beam-model
>Reporter: Eugene Kirpichov
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 51.5h
>  Remaining Estimate: 0h
>
> Design doc: [https://s.apache.org/get-metrics-api].
> Further discussion is ongoing on [this 
> doc|https://docs.google.com/document/d/1m83TsFvJbOlcLfXVXprQm1B7vUakhbLZMzuRrOHWnTg/edit?ts=5c826bb4#heading=h.faqan9rjc6dm].
> We want to report job metrics back to the portability harness from the runner 
> harness, for displaying to users.
> h1. Relevant PRs in flight:
> h2. Ready for Review:
>  * [#8022|https://github.com/apache/beam/pull/8022]: correct the Job RPC 
> protos from [#8018|https://github.com/apache/beam/pull/8018].
> h2. Iterating / Discussing:
>  * [#7971|https://github.com/apache/beam/pull/7971]: Flink portable metrics: 
> get ptransform from MonitoringInfo, not stage name
>  ** this is a simpler, Flink-specific PR that is basically duplicated inside 
> each of the following two, so may be worth trying to merge in first
>  * #[7915|https://github.com/apache/beam/pull/7915]: use MonitoringInfo data 
> model in Java SDK metrics
>  * [#7868|https://github.com/apache/beam/pull/7868]: MonitoringInfo URN tweaks
> h2. Merged
>  * [#8018|https://github.com/apache/beam/pull/8018]: add job metrics RPC 
> protos
>  * [#7867|https://github.com/apache/beam/pull/7867]: key MetricResult by a 
> MetricKey
>  * [#7938|https://github.com/apache/beam/pull/7938]: move MonitoringInfo 
> protos to model/pipeline module
>  * [#7883|https://github.com/apache/beam/pull/7883]: Add 
> MetricQueryResults.allMetrics() helper
>  * [#7866|https://github.com/apache/beam/pull/7866]: move function helpers 
> from fn-harness to sdks/java/core
>  * [#7890|https://github.com/apache/beam/pull/7890]: consolidate MetricResult 
> implementations
> h2. Closed
>  * [#7934|https://github.com/apache/beam/pull/7934]: job metrics RPC + SDK 
> support
>  * [#7876|https://github.com/apache/beam/pull/7876]: Clean up metric protos; 
> support integer distributions, gauges



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-4775) JobService should support returning metrics

2019-10-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-4775?focusedWorklogId=332897=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332897
 ]

ASF GitHub Bot logged work on BEAM-4775:


Author: ASF GitHub Bot
Created on: 23/Oct/19 21:12
Start Date: 23/Oct/19 21:12
Worklog Time Spent: 10m 
  Work Description: ajamato commented on pull request #9843: [BEAM-4775] 
Converting MonitoringInfos to MetricResults in PortableRunner
URL: https://github.com/apache/beam/pull/9843#discussion_r338278677
 
 

 ##
 File path: sdks/python/apache_beam/runners/portability/portable_metrics.py
 ##
 @@ -0,0 +1,74 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from __future__ import absolute_import
+
+from apache_beam.metrics import monitoring_infos
+from apache_beam.metrics.execution import MetricKey
+from apache_beam.metrics.metric import MetricName
+
+
+class ParseMonitoringInfoMixin(object):
+  @staticmethod
+  def from_monitoring_infos(monitoring_info_list, user_metrics_only=False):
+"""Groups MonitoringInfo objects into counters, distributions and gauges.
+
+Args:
+  monitoring_info_list: An iterable of MonitoringInfo objects.
+  user_metrics_only: If true, includes user metrics only.
+"""
+counters = {}
+distributions = {}
+gauges = {}
+
+for mi in monitoring_info_list:
+  if (user_metrics_only and
+  not monitoring_infos.is_user_monitoring_info(mi)):
+continue
+
+  key = ParseMonitoringInfoMixin._create_metric_key(mi)
+  metric_result = (monitoring_infos.extract_metric_result_map_value(mi))
+
+  if monitoring_infos.is_counter(mi):
+counters[key] = metric_result
+  elif monitoring_infos.is_distribution(mi):
+distributions[key] = metric_result
+  elif monitoring_infos.is_gauge(mi):
+gauges[key] = metric_result
+
+return counters, distributions, gauges
+
+  @staticmethod
+  def _create_metric_key(monitoring_info):
+step_name = ParseMonitoringInfoMixin._get_step_name(monitoring_info)
 
 Review comment:
   _get_step_name seems like it should live on monitoring_infos.py
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 332897)
Time Spent: 51.5h  (was: 51h 20m)

> JobService should support returning metrics
> ---
>
> Key: BEAM-4775
> URL: https://issues.apache.org/jira/browse/BEAM-4775
> Project: Beam
>  Issue Type: Bug
>  Components: beam-model
>Reporter: Eugene Kirpichov
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 51.5h
>  Remaining Estimate: 0h
>
> Design doc: [https://s.apache.org/get-metrics-api].
> Further discussion is ongoing on [this 
> doc|https://docs.google.com/document/d/1m83TsFvJbOlcLfXVXprQm1B7vUakhbLZMzuRrOHWnTg/edit?ts=5c826bb4#heading=h.faqan9rjc6dm].
> We want to report job metrics back to the portability harness from the runner 
> harness, for displaying to users.
> h1. Relevant PRs in flight:
> h2. Ready for Review:
>  * [#8022|https://github.com/apache/beam/pull/8022]: correct the Job RPC 
> protos from [#8018|https://github.com/apache/beam/pull/8018].
> h2. Iterating / Discussing:
>  * [#7971|https://github.com/apache/beam/pull/7971]: Flink portable metrics: 
> get ptransform from MonitoringInfo, not stage name
>  ** this is a simpler, Flink-specific PR that is basically duplicated inside 
> each of the following two, so may be worth trying to merge in first
>  * #[7915|https://github.com/apache/beam/pull/7915]: use MonitoringInfo data 
> model in Java SDK metrics
>  * [#7868|https://github.com/apache/beam/pull/7868]: MonitoringInfo URN tweaks
> h2. Merged
>  * [#8018|https://github.com/apache/beam/pull/8018]: add job metrics RPC 
> 

[jira] [Assigned] (BEAM-8459) Samza runner fails UsesStrictTimerOrdering category tests

2019-10-23 Thread Xinyu Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinyu Liu reassigned BEAM-8459:
---

Assignee: Xinyu Liu

> Samza runner fails UsesStrictTimerOrdering category tests
> -
>
> Key: BEAM-8459
> URL: https://issues.apache.org/jira/browse/BEAM-8459
> Project: Beam
>  Issue Type: Bug
>  Components: runner-samza
>Affects Versions: 2.17.0
>Reporter: Jan Lukavský
>Assignee: Xinyu Liu
>Priority: Major
>
> BEAM-7520 introduced new set of validatesRunner tests that test that timers 
> are fired exactly in order of increasing timestamp. Samza runner fails these 
> added tests (are currently ignored).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-4775) JobService should support returning metrics

2019-10-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-4775?focusedWorklogId=332884=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332884
 ]

ASF GitHub Bot logged work on BEAM-4775:


Author: ASF GitHub Bot
Created on: 23/Oct/19 21:01
Start Date: 23/Oct/19 21:01
Worklog Time Spent: 10m 
  Work Description: ajamato commented on pull request #9843: [BEAM-4775] 
Converting MonitoringInfos to MetricResults in PortableRunner
URL: https://github.com/apache/beam/pull/9843#discussion_r338276288
 
 

 ##
 File path: sdks/python/apache_beam/runners/portability/local_job_service.py
 ##
 @@ -107,6 +108,26 @@ def stop(self, timeout=1):
 if os.path.exists(self._staging_dir) and self._cleanup_staging_dir:
   shutil.rmtree(self._staging_dir, ignore_errors=True)
 
+  def GetJobMetrics(self, request, context=None):
+if request.job_id not in self._jobs:
+  raise LookupError("Job {} does not exist".format(request.job_id))
+
+result = self._jobs[request.job_id].result
+monitoring_info_list = []
+for mi in result._monitoring_infos_by_stage.values():
+  monitoring_info_list.extend(mi)
+
+# Filter out system metrics
+user_monitoring_info_list = [
+x for x in monitoring_info_list
+if monitoring_infos._is_user_monitoring_info(x) or
+monitoring_infos._is_user_distribution_monitoring_info(x)
+]
+
+return beam_job_api_pb2.GetJobMetricsResponse(
 
 Review comment:
   FYI, here is some background. I was not aware that MetricResult existed in 
proto form now. Creating a proto was suggested but not pursued at the time.
   https://s.apache.org/get-metrics-api
   
   When I last worked here MetricResult did not have a proto format. The plan 
was to use MonitoringInfos for a language agnostic format, and then 
MetricResult would be a language specific format (python, java, go, etc.) Each 
runner should provide a way to return MonitoringInfos, and each language would 
have  a library to convert MonitoringInfos to MetricResult protos
   
   Its hard for me to review, I don't think I am up to date on whatever the 
current plan/usage of these protos are.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 332884)
Time Spent: 50h 40m  (was: 50.5h)

> JobService should support returning metrics
> ---
>
> Key: BEAM-4775
> URL: https://issues.apache.org/jira/browse/BEAM-4775
> Project: Beam
>  Issue Type: Bug
>  Components: beam-model
>Reporter: Eugene Kirpichov
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 50h 40m
>  Remaining Estimate: 0h
>
> Design doc: [https://s.apache.org/get-metrics-api].
> Further discussion is ongoing on [this 
> doc|https://docs.google.com/document/d/1m83TsFvJbOlcLfXVXprQm1B7vUakhbLZMzuRrOHWnTg/edit?ts=5c826bb4#heading=h.faqan9rjc6dm].
> We want to report job metrics back to the portability harness from the runner 
> harness, for displaying to users.
> h1. Relevant PRs in flight:
> h2. Ready for Review:
>  * [#8022|https://github.com/apache/beam/pull/8022]: correct the Job RPC 
> protos from [#8018|https://github.com/apache/beam/pull/8018].
> h2. Iterating / Discussing:
>  * [#7971|https://github.com/apache/beam/pull/7971]: Flink portable metrics: 
> get ptransform from MonitoringInfo, not stage name
>  ** this is a simpler, Flink-specific PR that is basically duplicated inside 
> each of the following two, so may be worth trying to merge in first
>  * #[7915|https://github.com/apache/beam/pull/7915]: use MonitoringInfo data 
> model in Java SDK metrics
>  * [#7868|https://github.com/apache/beam/pull/7868]: MonitoringInfo URN tweaks
> h2. Merged
>  * [#8018|https://github.com/apache/beam/pull/8018]: add job metrics RPC 
> protos
>  * [#7867|https://github.com/apache/beam/pull/7867]: key MetricResult by a 
> MetricKey
>  * [#7938|https://github.com/apache/beam/pull/7938]: move MonitoringInfo 
> protos to model/pipeline module
>  * [#7883|https://github.com/apache/beam/pull/7883]: Add 
> MetricQueryResults.allMetrics() helper
>  * [#7866|https://github.com/apache/beam/pull/7866]: move function helpers 
> from fn-harness to sdks/java/core
>  * [#7890|https://github.com/apache/beam/pull/7890]: consolidate MetricResult 
> implementations
> h2. Closed
>  * [#7934|https://github.com/apache/beam/pull/7934]: job metrics RPC + SDK 
> support
>  * [#7876|https://github.com/apache/beam/pull/7876]: Clean up metric protos; 
> support integer 

[jira] [Work logged] (BEAM-4775) JobService should support returning metrics

2019-10-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-4775?focusedWorklogId=332886=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332886
 ]

ASF GitHub Bot logged work on BEAM-4775:


Author: ASF GitHub Bot
Created on: 23/Oct/19 21:01
Start Date: 23/Oct/19 21:01
Worklog Time Spent: 10m 
  Work Description: ajamato commented on pull request #9843: [BEAM-4775] 
Converting MonitoringInfos to MetricResults in PortableRunner
URL: https://github.com/apache/beam/pull/9843#discussion_r338276288
 
 

 ##
 File path: sdks/python/apache_beam/runners/portability/local_job_service.py
 ##
 @@ -107,6 +108,26 @@ def stop(self, timeout=1):
 if os.path.exists(self._staging_dir) and self._cleanup_staging_dir:
   shutil.rmtree(self._staging_dir, ignore_errors=True)
 
+  def GetJobMetrics(self, request, context=None):
+if request.job_id not in self._jobs:
+  raise LookupError("Job {} does not exist".format(request.job_id))
+
+result = self._jobs[request.job_id].result
+monitoring_info_list = []
+for mi in result._monitoring_infos_by_stage.values():
+  monitoring_info_list.extend(mi)
+
+# Filter out system metrics
+user_monitoring_info_list = [
+x for x in monitoring_info_list
+if monitoring_infos._is_user_monitoring_info(x) or
+monitoring_infos._is_user_distribution_monitoring_info(x)
+]
+
+return beam_job_api_pb2.GetJobMetricsResponse(
 
 Review comment:
   FYI, here is some background. I was not aware that MetricResult existed in 
proto form now. Creating a proto was suggested but not pursued at the time.
   https://s.apache.org/get-metrics-api
   
   When I last worked here MetricResult did not have a proto format. The plan 
was to use MonitoringInfos for a language agnostic format, and then 
MetricResult would be a language specific format (python, java, go, etc.) Each 
runner should provide a way to return MonitoringInfos, and each language would 
have  a library to convert MonitoringInfos to MetricResult protos
   
   It seems like you might be using a different approach, to use MetricResult 
protos as the language agnostic solution
   
   Its hard for me to review, I don't think I am up to date on whatever the 
current plan/usage of these protos are.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 332886)
Time Spent: 50h 50m  (was: 50h 40m)

> JobService should support returning metrics
> ---
>
> Key: BEAM-4775
> URL: https://issues.apache.org/jira/browse/BEAM-4775
> Project: Beam
>  Issue Type: Bug
>  Components: beam-model
>Reporter: Eugene Kirpichov
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 50h 50m
>  Remaining Estimate: 0h
>
> Design doc: [https://s.apache.org/get-metrics-api].
> Further discussion is ongoing on [this 
> doc|https://docs.google.com/document/d/1m83TsFvJbOlcLfXVXprQm1B7vUakhbLZMzuRrOHWnTg/edit?ts=5c826bb4#heading=h.faqan9rjc6dm].
> We want to report job metrics back to the portability harness from the runner 
> harness, for displaying to users.
> h1. Relevant PRs in flight:
> h2. Ready for Review:
>  * [#8022|https://github.com/apache/beam/pull/8022]: correct the Job RPC 
> protos from [#8018|https://github.com/apache/beam/pull/8018].
> h2. Iterating / Discussing:
>  * [#7971|https://github.com/apache/beam/pull/7971]: Flink portable metrics: 
> get ptransform from MonitoringInfo, not stage name
>  ** this is a simpler, Flink-specific PR that is basically duplicated inside 
> each of the following two, so may be worth trying to merge in first
>  * #[7915|https://github.com/apache/beam/pull/7915]: use MonitoringInfo data 
> model in Java SDK metrics
>  * [#7868|https://github.com/apache/beam/pull/7868]: MonitoringInfo URN tweaks
> h2. Merged
>  * [#8018|https://github.com/apache/beam/pull/8018]: add job metrics RPC 
> protos
>  * [#7867|https://github.com/apache/beam/pull/7867]: key MetricResult by a 
> MetricKey
>  * [#7938|https://github.com/apache/beam/pull/7938]: move MonitoringInfo 
> protos to model/pipeline module
>  * [#7883|https://github.com/apache/beam/pull/7883]: Add 
> MetricQueryResults.allMetrics() helper
>  * [#7866|https://github.com/apache/beam/pull/7866]: move function helpers 
> from fn-harness to sdks/java/core
>  * [#7890|https://github.com/apache/beam/pull/7890]: consolidate MetricResult 
> implementations
> h2. Closed
>  * [#7934|https://github.com/apache/beam/pull/7934]: job metrics 

[jira] [Work logged] (BEAM-4775) JobService should support returning metrics

2019-10-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-4775?focusedWorklogId=332883=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332883
 ]

ASF GitHub Bot logged work on BEAM-4775:


Author: ASF GitHub Bot
Created on: 23/Oct/19 21:00
Start Date: 23/Oct/19 21:00
Worklog Time Spent: 10m 
  Work Description: ajamato commented on pull request #9843: [BEAM-4775] 
Converting MonitoringInfos to MetricResults in PortableRunner
URL: https://github.com/apache/beam/pull/9843#discussion_r338276288
 
 

 ##
 File path: sdks/python/apache_beam/runners/portability/local_job_service.py
 ##
 @@ -107,6 +108,26 @@ def stop(self, timeout=1):
 if os.path.exists(self._staging_dir) and self._cleanup_staging_dir:
   shutil.rmtree(self._staging_dir, ignore_errors=True)
 
+  def GetJobMetrics(self, request, context=None):
+if request.job_id not in self._jobs:
+  raise LookupError("Job {} does not exist".format(request.job_id))
+
+result = self._jobs[request.job_id].result
+monitoring_info_list = []
+for mi in result._monitoring_infos_by_stage.values():
+  monitoring_info_list.extend(mi)
+
+# Filter out system metrics
+user_monitoring_info_list = [
+x for x in monitoring_info_list
+if monitoring_infos._is_user_monitoring_info(x) or
+monitoring_infos._is_user_distribution_monitoring_info(x)
+]
+
+return beam_job_api_pb2.GetJobMetricsResponse(
 
 Review comment:
   FYI, here is some background. I was not aware that these existed in proto 
form now. Creating a proto was suggested but not pursued at the time.
   https://s.apache.org/get-metrics-api
   
   When I last worked here MetricResult did not have a proto format. The plan 
was to use MonitoringInfos for a language agnostic format, and then 
MetricResult would be a language specific format (python, java, go, etc.) Each 
runner should provide a way to return MonitoringInfos, and each language would 
have  a library to convert MonitoringInfos to MetricResult protos
   
   Its hard for me to review, I don't think I am up to date on whatever the 
current plan/usage of these protos are.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 332883)
Time Spent: 50.5h  (was: 50h 20m)

> JobService should support returning metrics
> ---
>
> Key: BEAM-4775
> URL: https://issues.apache.org/jira/browse/BEAM-4775
> Project: Beam
>  Issue Type: Bug
>  Components: beam-model
>Reporter: Eugene Kirpichov
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 50.5h
>  Remaining Estimate: 0h
>
> Design doc: [https://s.apache.org/get-metrics-api].
> Further discussion is ongoing on [this 
> doc|https://docs.google.com/document/d/1m83TsFvJbOlcLfXVXprQm1B7vUakhbLZMzuRrOHWnTg/edit?ts=5c826bb4#heading=h.faqan9rjc6dm].
> We want to report job metrics back to the portability harness from the runner 
> harness, for displaying to users.
> h1. Relevant PRs in flight:
> h2. Ready for Review:
>  * [#8022|https://github.com/apache/beam/pull/8022]: correct the Job RPC 
> protos from [#8018|https://github.com/apache/beam/pull/8018].
> h2. Iterating / Discussing:
>  * [#7971|https://github.com/apache/beam/pull/7971]: Flink portable metrics: 
> get ptransform from MonitoringInfo, not stage name
>  ** this is a simpler, Flink-specific PR that is basically duplicated inside 
> each of the following two, so may be worth trying to merge in first
>  * #[7915|https://github.com/apache/beam/pull/7915]: use MonitoringInfo data 
> model in Java SDK metrics
>  * [#7868|https://github.com/apache/beam/pull/7868]: MonitoringInfo URN tweaks
> h2. Merged
>  * [#8018|https://github.com/apache/beam/pull/8018]: add job metrics RPC 
> protos
>  * [#7867|https://github.com/apache/beam/pull/7867]: key MetricResult by a 
> MetricKey
>  * [#7938|https://github.com/apache/beam/pull/7938]: move MonitoringInfo 
> protos to model/pipeline module
>  * [#7883|https://github.com/apache/beam/pull/7883]: Add 
> MetricQueryResults.allMetrics() helper
>  * [#7866|https://github.com/apache/beam/pull/7866]: move function helpers 
> from fn-harness to sdks/java/core
>  * [#7890|https://github.com/apache/beam/pull/7890]: consolidate MetricResult 
> implementations
> h2. Closed
>  * [#7934|https://github.com/apache/beam/pull/7934]: job metrics RPC + SDK 
> support
>  * [#7876|https://github.com/apache/beam/pull/7876]: Clean up metric protos; 
> support integer distributions, 

[jira] [Work logged] (BEAM-8335) Add streaming support to Interactive Beam

2019-10-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8335?focusedWorklogId=332879=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332879
 ]

ASF GitHub Bot logged work on BEAM-8335:


Author: ASF GitHub Bot
Created on: 23/Oct/19 20:52
Start Date: 23/Oct/19 20:52
Worklog Time Spent: 10m 
  Work Description: rohdesamuel commented on issue #9720: [BEAM-8335] Add 
initial modules for interactive streaming support
URL: https://github.com/apache/beam/pull/9720#issuecomment-545631086
 
 
   I discussed with Robert the necessity of the GRPC service. In it, we agreed 
that the easiest way to support an interactive streaming job is through an HTTP 
service. Unfortunately, using HTTP transcoding with GRPC requires having a copy 
certain protos in the repo to be used during proto compilation. Thus, I will 
make a separate HTTP server without using the GRPC service definition and 
remove all InteractiveService rpcs _except_ for the Events request.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 332879)
Time Spent: 11h  (was: 10h 50m)

> Add streaming support to Interactive Beam
> -
>
> Key: BEAM-8335
> URL: https://issues.apache.org/jira/browse/BEAM-8335
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-py-interactive
>Reporter: Sam Rohde
>Assignee: Sam Rohde
>Priority: Major
>  Time Spent: 11h
>  Remaining Estimate: 0h
>
> This issue tracks the work items to introduce streaming support to the 
> Interactive Beam experience. This will allow users to:
>  * Write and run a streaming job in IPython
>  * Automatically cache records from unbounded sources
>  * Add a replay experience that replays all cached records to simulate the 
> original pipeline execution
>  * Add controls to play/pause/stop/step individual elements from the cached 
> records
>  * Add ability to inspect/visualize unbounded PCollections



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-8468) Add predicate/filter push-down capability to IO APIs

2019-10-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8468?focusedWorklogId=332869=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332869
 ]

ASF GitHub Bot logged work on BEAM-8468:


Author: ASF GitHub Bot
Created on: 23/Oct/19 20:46
Start Date: 23/Oct/19 20:46
Worklog Time Spent: 10m 
  Work Description: 11moon11 commented on pull request #9863: [BEAM-8468] 
Predicate push down for in memory table
URL: https://github.com/apache/beam/pull/9863
 
 
   - Create a class `TestTableFilter implements BeamSqlTableFilter`
   - Update `buildIOReader(PBegin begin, BeamSqlTableFilter filters, 
List fieldNames)` method
   - Add `BeamSqlTableFilter constructFilter(List filter)` method
   - Update push-down rule
   
   
   
   Thank you for your contribution! Follow this checklist to help us 
incorporate your contribution quickly and easily:
   
- [ ] [**Choose 
reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and 
mention them in a comment (`R: @username`).
- [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in 
ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA 
issue, if applicable. This will automatically link the pull request to the 
issue.
- [ ] If this contribution is large, please file an Apache [Individual 
Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   See the [Contributor Guide](https://beam.apache.org/contribute) for more 
tips on [how to make review process 
smoother](https://beam.apache.org/contribute/#make-reviewers-job-easier).
   
   Post-Commit Tests Status (on master branch)
   

   
   Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark
   --- | --- | --- | --- | --- | --- | --- | ---
   Go | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/)
 | --- | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/)
 | --- | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/)
   Java | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/)
   Python | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/)[![Build
 

[jira] [Work logged] (BEAM-8468) Add predicate/filter push-down capability to IO APIs

2019-10-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8468?focusedWorklogId=332870=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332870
 ]

ASF GitHub Bot logged work on BEAM-8468:


Author: ASF GitHub Bot
Created on: 23/Oct/19 20:46
Start Date: 23/Oct/19 20:46
Worklog Time Spent: 10m 
  Work Description: 11moon11 commented on issue #9863: [BEAM-8468] 
Predicate push down for in memory table
URL: https://github.com/apache/beam/pull/9863#issuecomment-545628821
 
 
   R: @apilloud 
   cc: @amaliujia 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 332870)
Time Spent: 20m  (was: 10m)

> Add predicate/filter push-down capability to IO APIs
> 
>
> Key: BEAM-8468
> URL: https://issues.apache.org/jira/browse/BEAM-8468
> Project: Beam
>  Issue Type: New Feature
>  Components: dsl-sql
>Reporter: Kirill Kozlov
>Assignee: Kirill Kozlov
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> * InMemoryTable should implement a following methods:
> {code:java}
> public PCollection buildIOReader(
> PBegin begin, BeamSqlTableFilter filters, List fieldNames);
> public BeamSqlTableFilter constructFilter(List filter);
> {code}
>  * Update a push-down rule to support predicate/filter push-down.
>  * Create a class
> {code:java}
> class TestTableFilter implements BeamSqlTableFilter{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-8468) Add predicate/filter push-down capability to IO APIs

2019-10-23 Thread Kirill Kozlov (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kirill Kozlov updated BEAM-8468:

Description: 
* InMemoryTable should implement a following methods:

{code:java}
public PCollection buildIOReader(
PBegin begin, BeamSqlTableFilter filters, List fieldNames);

public BeamSqlTableFilter constructFilter(List filter);
{code}
 * Update a push-down rule to support predicate/filter push-down.
 * Create a class

{code:java}
class TestTableFilter implements BeamSqlTableFilter{code}
 

  was:
* InMemoryTable should implement a following methods:

 
{code:java}
public PCollection buildIOReader(
PBegin begin, BeamSqlTableFilter filters, List fieldNames);

public BeamSqlTableFilter constructFilter(List filter);
{code}
 * Update a push-down rule to support predicate/filter push-down.
 * Create a class

{code:java}
class TestTableFilter implements BeamSqlTableFilter{code}

 


> Add predicate/filter push-down capability to IO APIs
> 
>
> Key: BEAM-8468
> URL: https://issues.apache.org/jira/browse/BEAM-8468
> Project: Beam
>  Issue Type: New Feature
>  Components: dsl-sql
>Reporter: Kirill Kozlov
>Assignee: Kirill Kozlov
>Priority: Major
>
> * InMemoryTable should implement a following methods:
> {code:java}
> public PCollection buildIOReader(
> PBegin begin, BeamSqlTableFilter filters, List fieldNames);
> public BeamSqlTableFilter constructFilter(List filter);
> {code}
>  * Update a push-down rule to support predicate/filter push-down.
>  * Create a class
> {code:java}
> class TestTableFilter implements BeamSqlTableFilter{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (BEAM-8468) Add predicate/filter push-down capability to IO APIs

2019-10-23 Thread Kirill Kozlov (Jira)
Kirill Kozlov created BEAM-8468:
---

 Summary: Add predicate/filter push-down capability to IO APIs
 Key: BEAM-8468
 URL: https://issues.apache.org/jira/browse/BEAM-8468
 Project: Beam
  Issue Type: New Feature
  Components: dsl-sql
Reporter: Kirill Kozlov
Assignee: Kirill Kozlov


* InMemoryTable should implement a following methods:

 
{code:java}
public PCollection buildIOReader(
PBegin begin, BeamSqlTableFilter filters, List fieldNames);

public BeamSqlTableFilter constructFilter(List filter);
{code}
 * Update a push-down rule to support predicate/filter push-down.
 * Create a class

{code:java}
class TestTableFilter implements BeamSqlTableFilter{code}

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-8457) Instrument Dataflow jobs that are launched from Notebooks

2019-10-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8457?focusedWorklogId=332850=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332850
 ]

ASF GitHub Bot logged work on BEAM-8457:


Author: ASF GitHub Bot
Created on: 23/Oct/19 20:25
Start Date: 23/Oct/19 20:25
Worklog Time Spent: 10m 
  Work Description: pabloem commented on pull request #9854: [BEAM-8457] 
Label Dataflow jobs from Notebook
URL: https://github.com/apache/beam/pull/9854
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 332850)
Time Spent: 1.5h  (was: 1h 20m)

> Instrument Dataflow jobs that are launched from Notebooks
> -
>
> Key: BEAM-8457
> URL: https://issues.apache.org/jira/browse/BEAM-8457
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-py-interactive
>Reporter: Ning Kang
>Assignee: Ning Kang
>Priority: Major
> Fix For: 2.17.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Dataflow needs the capability to tell how many Dataflow jobs are launched 
> from the Notebook environment, i.e., the Interactive Runner.
>  # Change the pipeline.run() API to allow supply a runner and an option 
> parameter so that a pipeline initially bundled w/ an interactive runner can 
> be directly run by other runners from notebook.
>  # Implicitly add the necessary source information through user labels when 
> the user does p.run(runner=DataflowRunner()).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-8457) Instrument Dataflow jobs that are launched from Notebooks

2019-10-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8457?focusedWorklogId=332851=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332851
 ]

ASF GitHub Bot logged work on BEAM-8457:


Author: ASF GitHub Bot
Created on: 23/Oct/19 20:25
Start Date: 23/Oct/19 20:25
Worklog Time Spent: 10m 
  Work Description: pabloem commented on issue #9854: [BEAM-8457] Label 
Dataflow jobs from Notebook
URL: https://github.com/apache/beam/pull/9854#issuecomment-545620626
 
 
   Thanks Ning
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 332851)
Time Spent: 1h 40m  (was: 1.5h)

> Instrument Dataflow jobs that are launched from Notebooks
> -
>
> Key: BEAM-8457
> URL: https://issues.apache.org/jira/browse/BEAM-8457
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-py-interactive
>Reporter: Ning Kang
>Assignee: Ning Kang
>Priority: Major
> Fix For: 2.17.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Dataflow needs the capability to tell how many Dataflow jobs are launched 
> from the Notebook environment, i.e., the Interactive Runner.
>  # Change the pipeline.run() API to allow supply a runner and an option 
> parameter so that a pipeline initially bundled w/ an interactive runner can 
> be directly run by other runners from notebook.
>  # Implicitly add the necessary source information through user labels when 
> the user does p.run(runner=DataflowRunner()).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-8397) DataflowRunnerTest.test_remote_runner_display_data fails due to infinite recursion during pickling.

2019-10-23 Thread Valentyn Tymofieiev (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-8397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958213#comment-16958213
 ] 

Valentyn Tymofieiev commented on BEAM-8397:
---

Looking closer at the test_remote_runner_display_data test, I have a feeling 
that the expected values were chosen by whatever values made the test pass. For 
example, the test expects  [1]entries "a_class", "a_time", but does not expect 
"asubcomponent"

[1]https://github.com/apache/beam/blob/3330069291d8168c56c77acfef84c2566af05ec6/sdks/python/apache_beam/runners/dataflow/dataflow_runner_test.py#L273

> DataflowRunnerTest.test_remote_runner_display_data fails due to infinite 
> recursion during pickling.
> ---
>
> Key: BEAM-8397
> URL: https://issues.apache.org/jira/browse/BEAM-8397
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Valentyn Tymofieiev
>Priority: Major
>
> `python ./setup.py test -s 
> apache_beam.runners.dataflow.dataflow_runner_test.DataflowRunnerTest.test_remote_runner_display_data`
>  passes.
> `tox -e py37-gcp` passes if Beam depends on dill==0.3.0, but fails if Beam 
> depends on dill==0.3.1.1.`python ./setup.py nosetests --tests 
> 'apache_beam/runners/dataflow/dataflow_runner_test.py:DataflowRunnerTest.test_remote_runner_display_data`
>  fails currently if run on master.
> The failure indicates infinite recursion during pickling:
> {noformat}
> test_remote_runner_display_data 
> (apache_beam.runners.dataflow.dataflow_runner_test.DataflowRunnerTest) ... 
> Fatal Python error: Cannot recover from stack overflow.
> Current thread 0x7f9d700ed740 (most recent call first):
>   File "/usr/lib/python3.7/pickle.py", line 479 in get
>   File "/usr/lib/python3.7/pickle.py", line 497 in save
>   File "/usr/lib/python3.7/pickle.py", line 786 in save_tuple
>   File "/usr/lib/python3.7/pickle.py", line 504 in save
>   File "/usr/lib/python3.7/pickle.py", line 638 in save_reduce
>   File 
> "/usr/local/google/home/valentyn/tmp/py37env/lib/python3.7/site-packages/dill/_dill.py",
>  line 1394 in save_function
>   File "/usr/lib/python3.7/pickle.py", line 504 in save
>   File "/usr/lib/python3.7/pickle.py", line 882 in _batch_setitems
>   File "/usr/lib/python3.7/pickle.py", line 856 in save_dict
>   File 
> "/usr/local/google/home/valentyn/tmp/py37env/lib/python3.7/site-packages/dill/_dill.py",
>  line 910 in save_module_dict
>   File 
> "/usr/local/google/home/valentyn/projects/beam/clean/beam/sdks/python/apache_beam/internal/pickler.py",
>  line 198 in new_save_module_dict
>   File "/usr/lib/python3.7/pickle.py", line 504 in save
>   File "/usr/lib/python3.7/pickle.py", line 786 in save_tuple
>   File "/usr/lib/python3.7/pickle.py", line 504 in save
>   File "/usr/lib/python3.7/pickle.py", line 638 in save_reduce
>   File 
> "/usr/local/google/home/valentyn/projects/beam/clean/beam/sdks/python/apache_beam/internal/pickler.py",
>  line 114 in wrapper
>   File "/usr/lib/python3.7/pickle.py", line 504 in save
>   File "/usr/lib/python3.7/pickle.py", line 771 in save_tuple
>   File "/usr/lib/python3.7/pickle.py", line 504 in save
>   File "/usr/lib/python3.7/pickle.py", line 638 in save_reduce
>   File 
> "/usr/local/google/home/valentyn/tmp/py37env/lib/python3.7/site-packages/dill/_dill.py",
>  line 1137 in save_cell
>   File "/usr/lib/python3.7/pickle.py", line 504 in save
>   File "/usr/lib/python3.7/pickle.py", line 771 in save_tuple
>   File "/usr/lib/python3.7/pickle.py", line 504 in save
>   File "/usr/lib/python3.7/pickle.py", line 786 in save_tuple
>   File "/usr/lib/python3.7/pickle.py", line 504 in save
>   File "/usr/lib/python3.7/pickle.py", line 638 in save_reduce
>   File 
> "/usr/local/google/home/valentyn/tmp/py37env/lib/python3.7/site-packages/dill/_dill.py",
>  line 1394 in save_function
>   File "/usr/lib/python3.7/pickle.py", line 504 in save
>   File "/usr/lib/python3.7/pickle.py", line 882 in _batch_setitems
>   File "/usr/lib/python3.7/pickle.py", line 856 in save_dict
>   File 
> "/usr/local/google/home/valentyn/tmp/py37env/lib/python3.7/site-packages/dill/_dill.py",
>  line 910 in save_module_dict
>   File 
> "/usr/local/google/home/valentyn/projects/beam/clean/beam/sdks/python/apache_beam/internal/pickler.py",
>  line 198 in new_save_module_dict
> ...
> {noformat}
> cc: [~yoshiki.obata]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (BEAM-8397) DataflowRunnerTest.test_remote_runner_display_data fails due to infinite recursion during pickling.

2019-10-23 Thread Valentyn Tymofieiev (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-8397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958213#comment-16958213
 ] 

Valentyn Tymofieiev edited comment on BEAM-8397 at 10/23/19 8:12 PM:
-

Looking closer at the test_remote_runner_display_data test, I have a feeling 
that the expected values were chosen by whatever values made the test pass. For 
example, the test expects[1] entries "a_class", "a_time", but does not expect 
"asubcomponent"

[1] 
https://github.com/apache/beam/blob/3330069291d8168c56c77acfef84c2566af05ec6/sdks/python/apache_beam/runners/dataflow/dataflow_runner_test.py#L273


was (Author: tvalentyn):
Looking closer at the test_remote_runner_display_data test, I have a feeling 
that the expected values were chosen by whatever values made the test pass. For 
example, the test expects  [1]entries "a_class", "a_time", but does not expect 
"asubcomponent"

[1]https://github.com/apache/beam/blob/3330069291d8168c56c77acfef84c2566af05ec6/sdks/python/apache_beam/runners/dataflow/dataflow_runner_test.py#L273

> DataflowRunnerTest.test_remote_runner_display_data fails due to infinite 
> recursion during pickling.
> ---
>
> Key: BEAM-8397
> URL: https://issues.apache.org/jira/browse/BEAM-8397
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Valentyn Tymofieiev
>Priority: Major
>
> `python ./setup.py test -s 
> apache_beam.runners.dataflow.dataflow_runner_test.DataflowRunnerTest.test_remote_runner_display_data`
>  passes.
> `tox -e py37-gcp` passes if Beam depends on dill==0.3.0, but fails if Beam 
> depends on dill==0.3.1.1.`python ./setup.py nosetests --tests 
> 'apache_beam/runners/dataflow/dataflow_runner_test.py:DataflowRunnerTest.test_remote_runner_display_data`
>  fails currently if run on master.
> The failure indicates infinite recursion during pickling:
> {noformat}
> test_remote_runner_display_data 
> (apache_beam.runners.dataflow.dataflow_runner_test.DataflowRunnerTest) ... 
> Fatal Python error: Cannot recover from stack overflow.
> Current thread 0x7f9d700ed740 (most recent call first):
>   File "/usr/lib/python3.7/pickle.py", line 479 in get
>   File "/usr/lib/python3.7/pickle.py", line 497 in save
>   File "/usr/lib/python3.7/pickle.py", line 786 in save_tuple
>   File "/usr/lib/python3.7/pickle.py", line 504 in save
>   File "/usr/lib/python3.7/pickle.py", line 638 in save_reduce
>   File 
> "/usr/local/google/home/valentyn/tmp/py37env/lib/python3.7/site-packages/dill/_dill.py",
>  line 1394 in save_function
>   File "/usr/lib/python3.7/pickle.py", line 504 in save
>   File "/usr/lib/python3.7/pickle.py", line 882 in _batch_setitems
>   File "/usr/lib/python3.7/pickle.py", line 856 in save_dict
>   File 
> "/usr/local/google/home/valentyn/tmp/py37env/lib/python3.7/site-packages/dill/_dill.py",
>  line 910 in save_module_dict
>   File 
> "/usr/local/google/home/valentyn/projects/beam/clean/beam/sdks/python/apache_beam/internal/pickler.py",
>  line 198 in new_save_module_dict
>   File "/usr/lib/python3.7/pickle.py", line 504 in save
>   File "/usr/lib/python3.7/pickle.py", line 786 in save_tuple
>   File "/usr/lib/python3.7/pickle.py", line 504 in save
>   File "/usr/lib/python3.7/pickle.py", line 638 in save_reduce
>   File 
> "/usr/local/google/home/valentyn/projects/beam/clean/beam/sdks/python/apache_beam/internal/pickler.py",
>  line 114 in wrapper
>   File "/usr/lib/python3.7/pickle.py", line 504 in save
>   File "/usr/lib/python3.7/pickle.py", line 771 in save_tuple
>   File "/usr/lib/python3.7/pickle.py", line 504 in save
>   File "/usr/lib/python3.7/pickle.py", line 638 in save_reduce
>   File 
> "/usr/local/google/home/valentyn/tmp/py37env/lib/python3.7/site-packages/dill/_dill.py",
>  line 1137 in save_cell
>   File "/usr/lib/python3.7/pickle.py", line 504 in save
>   File "/usr/lib/python3.7/pickle.py", line 771 in save_tuple
>   File "/usr/lib/python3.7/pickle.py", line 504 in save
>   File "/usr/lib/python3.7/pickle.py", line 786 in save_tuple
>   File "/usr/lib/python3.7/pickle.py", line 504 in save
>   File "/usr/lib/python3.7/pickle.py", line 638 in save_reduce
>   File 
> "/usr/local/google/home/valentyn/tmp/py37env/lib/python3.7/site-packages/dill/_dill.py",
>  line 1394 in save_function
>   File "/usr/lib/python3.7/pickle.py", line 504 in save
>   File "/usr/lib/python3.7/pickle.py", line 882 in _batch_setitems
>   File "/usr/lib/python3.7/pickle.py", line 856 in save_dict
>   File 
> "/usr/local/google/home/valentyn/tmp/py37env/lib/python3.7/site-packages/dill/_dill.py",
>  line 910 in save_module_dict
>   File 
> 

[jira] [Work logged] (BEAM-7926) Visualize PCollection with Interactive Beam

2019-10-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7926?focusedWorklogId=332841=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332841
 ]

ASF GitHub Bot logged work on BEAM-7926:


Author: ASF GitHub Bot
Created on: 23/Oct/19 20:11
Start Date: 23/Oct/19 20:11
Worklog Time Spent: 10m 
  Work Description: KevinGG commented on pull request #9741: [BEAM-7926] 
Visualize PCollection
URL: https://github.com/apache/beam/pull/9741#discussion_r338254102
 
 

 ##
 File path: 
sdks/python/apache_beam/runners/interactive/display/pcoll_visualization_test.py
 ##
 @@ -0,0 +1,152 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Tests for apache_beam.runners.interactive.display.pcoll_visualization."""
+from __future__ import absolute_import
+
+import sys
+import time
+import unittest
+
+import apache_beam as beam  # pylint: disable=ungrouped-imports
+import timeloop
+from apache_beam.runners import runner
+from apache_beam.runners.interactive import interactive_environment as ie
+from apache_beam.runners.interactive.display import pcoll_visualization as pv
+
+# Work around nose tests using Python2 without unittest.mock module.
+try:
+  from unittest.mock import patch
+except ImportError:
+  from mock import patch
+
+
+class PCollVisualizationTest(unittest.TestCase):
+
+  def setUp(self):
+self._p = beam.Pipeline()
+# pylint: disable=range-builtin-not-iterating
+self._pcoll = self._p | 'Create' >> beam.Create(range(1000))
+
+  @unittest.skipIf(sys.version_info < (3, 5, 3),
+   'PCollVisualization is not supported on Python 2.')
 
 Review comment:
   Changing the error message.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 332841)
Time Spent: 13h 20m  (was: 13h 10m)

> Visualize PCollection with Interactive Beam
> ---
>
> Key: BEAM-7926
> URL: https://issues.apache.org/jira/browse/BEAM-7926
> Project: Beam
>  Issue Type: New Feature
>  Components: runner-py-interactive
>Reporter: Ning Kang
>Assignee: Ning Kang
>Priority: Major
>  Time Spent: 13h 20m
>  Remaining Estimate: 0h
>
> Support auto plotting / charting of materialized data of a given PCollection 
> with Interactive Beam.
> Say an Interactive Beam pipeline defined as
> p = create_pipeline()
> pcoll = p | 'Transform' >> transform()
> The use can call a single function and get auto-magical charting of the data 
> as materialized pcoll.
> e.g., visualize(pcoll)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-7926) Visualize PCollection with Interactive Beam

2019-10-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7926?focusedWorklogId=332840=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332840
 ]

ASF GitHub Bot logged work on BEAM-7926:


Author: ASF GitHub Bot
Created on: 23/Oct/19 20:10
Start Date: 23/Oct/19 20:10
Worklog Time Spent: 10m 
  Work Description: KevinGG commented on pull request #9741: [BEAM-7926] 
Visualize PCollection
URL: https://github.com/apache/beam/pull/9741#discussion_r338253651
 
 

 ##
 File path: 
sdks/python/apache_beam/runners/interactive/display/pcoll_visualization.py
 ##
 @@ -0,0 +1,279 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module visualizes PCollection data.
+
+For internal use only; no backwards-compatibility guarantees.
+Only works with Python 3.5+.
+"""
+from __future__ import absolute_import
+
+import base64
+import logging
+from datetime import timedelta
+
+from pandas.io.json import json_normalize
+
+from apache_beam import pvalue
+from apache_beam.runners.interactive import interactive_environment as ie
+from apache_beam.runners.interactive import pipeline_instrument as instr
+
+# jsons doesn't support < Python 3.5. Work around with json for legacy tests.
+# TODO(BEAM-8288): clean up once Py2 is deprecated from Beam.
+try:
+  import jsons
+  _pv_jsons_load = jsons.load
+  _pv_jsons_dump = jsons.dump
+except ImportError:
+  import json
+  _pv_jsons_load = json.load
+  _pv_jsons_dump = json.dump
+
+try:
+  from facets_overview.generic_feature_statistics_generator import 
GenericFeatureStatisticsGenerator
+  _facets_gfsg_ready = True
+except ImportError:
+  _facets_gfsg_ready = False
+
+try:
+  from IPython.core.display import HTML
+  from IPython.core.display import Javascript
+  from IPython.core.display import display
+  from IPython.core.display import display_javascript
+  from IPython.core.display import update_display
+  _ipython_ready = True
+except ImportError:
+  _ipython_ready = False
+
+try:
+  from timeloop import Timeloop
+  _tl_ready = True
+except ImportError:
+  _tl_ready = False
+
+# 1-d types that need additional normalization to be compatible with DataFrame.
+_one_dimension_types = (int, float, str, bool, list, tuple)
+
+_DIVE_SCRIPT_TEMPLATE = """
+document.querySelector("#{display_id}").data = {jsonstr};"""
+_DIVE_HTML_TEMPLATE = """
+https://cdnjs.cloudflare.com/ajax/libs/webcomponentsjs/1.3.3/webcomponents-lite.js";>
+https://raw.githubusercontent.com/PAIR-code/facets/1.0.0/facets-dist/facets-jupyter.html;>
+
+
+  document.querySelector("#{display_id}").data = {jsonstr};
+"""
+_OVERVIEW_SCRIPT_TEMPLATE = """
+  document.querySelector("#{display_id}").protoInput = 
"{protostr}";
+  """
+_OVERVIEW_HTML_TEMPLATE = """
+https://cdnjs.cloudflare.com/ajax/libs/webcomponentsjs/1.3.3/webcomponents-lite.js";>
+https://raw.githubusercontent.com/PAIR-code/facets/1.0.0/facets-dist/facets-jupyter.html;>
+
+
+  document.querySelector("#{display_id}").protoInput = 
"{protostr}";
+"""
+_DATAFRAME_PAGINATION_TEMPLATE = """
+https://ajax.googleapis.com/ajax/libs/jquery/2.2.2/jquery.min.js";>
 
+https://cdn.datatables.net/1.10.16/js/jquery.dataTables.js";> 
+https://cdn.datatables.net/1.10.16/css/jquery.dataTables.css;>
+{dataframe_html}
+
+  $("#{table_id}").DataTable();
+"""
+
+
+def visualize(pcoll, dynamic_plotting_interval=None):
+  """Visualizes the data of a given PCollection. Optionally enables dynamic
+  plotting with interval in seconds if the PCollection is being produced by a
+  running pipeline or the pipeline is streaming indefinitely. The function
+  always returns immediately and is asynchronous when dynamic plotting is on.
+
+  If dynamic plotting enabled, the visualization is updated continuously until
+  the pipeline producing the PCollection is in an end state. The visualization
+  would be anchored to the notebook cell output 

[jira] [Work logged] (BEAM-8432) Parametrize source & target compatibility for beam Java modules

2019-10-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8432?focusedWorklogId=332838=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332838
 ]

ASF GitHub Bot logged work on BEAM-8432:


Author: ASF GitHub Bot
Created on: 23/Oct/19 20:09
Start Date: 23/Oct/19 20:09
Worklog Time Spent: 10m 
  Work Description: lgajowy commented on issue #9830: [BEAM-8432] Move 
javaVersion to gradle.properties
URL: https://github.com/apache/beam/pull/9830#issuecomment-545614858
 
 
   Run Python PreCommit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 332838)
Time Spent: 20m  (was: 10m)

> Parametrize source & target compatibility for beam Java modules
> ---
>
> Key: BEAM-8432
> URL: https://issues.apache.org/jira/browse/BEAM-8432
> Project: Beam
>  Issue Type: Improvement
>  Components: build-system
>Reporter: Lukasz Gajowy
>Assignee: Lukasz Gajowy
>Priority: Major
> Fix For: Not applicable
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Currently, "javaVersion" property is hardcoded in BeamModulePlugin in 
> [JavaNatureConfiguration|https://github.com/apache/beam/blob/master/buildSrc/src/main/groovy/org/apache/beam/gradle/BeamModulePlugin.groovy#L82].
> For the sake of migrating the project to Java 11 we could use a mechanism 
> that will allow parametrizing the version from the command line, e.g:
> {code:java}
> // this could set source and target compatibility to 11:
> ./gradlew clean build -PjavaVersion=11{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-8432) Parametrize source & target compatibility for beam Java modules

2019-10-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8432?focusedWorklogId=332839=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332839
 ]

ASF GitHub Bot logged work on BEAM-8432:


Author: ASF GitHub Bot
Created on: 23/Oct/19 20:09
Start Date: 23/Oct/19 20:09
Worklog Time Spent: 10m 
  Work Description: lgajowy commented on issue #9830: [BEAM-8432] Move 
javaVersion to gradle.properties
URL: https://github.com/apache/beam/pull/9830#issuecomment-545614924
 
 
   Run CommunityMetrics PreCommit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 332839)
Time Spent: 0.5h  (was: 20m)

> Parametrize source & target compatibility for beam Java modules
> ---
>
> Key: BEAM-8432
> URL: https://issues.apache.org/jira/browse/BEAM-8432
> Project: Beam
>  Issue Type: Improvement
>  Components: build-system
>Reporter: Lukasz Gajowy
>Assignee: Lukasz Gajowy
>Priority: Major
> Fix For: Not applicable
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Currently, "javaVersion" property is hardcoded in BeamModulePlugin in 
> [JavaNatureConfiguration|https://github.com/apache/beam/blob/master/buildSrc/src/main/groovy/org/apache/beam/gradle/BeamModulePlugin.groovy#L82].
> For the sake of migrating the project to Java 11 we could use a mechanism 
> that will allow parametrizing the version from the command line, e.g:
> {code:java}
> // this could set source and target compatibility to 11:
> ./gradlew clean build -PjavaVersion=11{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-8446) apache_beam.io.gcp.bigquery_write_it_test.BigQueryWriteIntegrationTests.test_big_query_write_new_types is flaky

2019-10-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8446?focusedWorklogId=332834=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332834
 ]

ASF GitHub Bot logged work on BEAM-8446:


Author: ASF GitHub Bot
Created on: 23/Oct/19 19:59
Start Date: 23/Oct/19 19:59
Worklog Time Spent: 10m 
  Work Description: pabloem commented on issue #9855: [BEAM-8446] Retrying 
BQ query on timeouts
URL: https://github.com/apache/beam/pull/9855#issuecomment-545611313
 
 
   Run Python PreCommit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 332834)
Time Spent: 2h 40m  (was: 2.5h)

> apache_beam.io.gcp.bigquery_write_it_test.BigQueryWriteIntegrationTests.test_big_query_write_new_types
>  is flaky
> ---
>
> Key: BEAM-8446
> URL: https://issues.apache.org/jira/browse/BEAM-8446
> Project: Beam
>  Issue Type: New Feature
>  Components: test-failures
>Reporter: Boyuan Zhang
>Assignee: Pablo Estrada
>Priority: Major
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> test_big_query_write_new_types appears to be flaky in 
> beam_PostCommit_Python37 test suite.
> https://builds.apache.org/job/beam_PostCommit_Python37/733/
> https://builds.apache.org/job/beam_PostCommit_Python37/739/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-8446) apache_beam.io.gcp.bigquery_write_it_test.BigQueryWriteIntegrationTests.test_big_query_write_new_types is flaky

2019-10-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8446?focusedWorklogId=332833=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332833
 ]

ASF GitHub Bot logged work on BEAM-8446:


Author: ASF GitHub Bot
Created on: 23/Oct/19 19:59
Start Date: 23/Oct/19 19:59
Worklog Time Spent: 10m 
  Work Description: pabloem commented on issue #9855: [BEAM-8446] Retrying 
BQ query on timeouts
URL: https://github.com/apache/beam/pull/9855#issuecomment-545170744
 
 
   First instance of job running:
   https://builds.apache.org/job/beam_PostCommit_Python37_PR/42/
   Next:
   https://builds.apache.org/job/beam_PostCommit_Python37_PR/43/
   Next:
   https://builds.apache.org/job/beam_PostCommit_Python37_PR/44/ - timeout 
without errors
   Next:
   https://builds.apache.org/job/beam_PostCommit_Python37_PR/45/ - timeout 
without errors
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 332833)
Time Spent: 2.5h  (was: 2h 20m)

> apache_beam.io.gcp.bigquery_write_it_test.BigQueryWriteIntegrationTests.test_big_query_write_new_types
>  is flaky
> ---
>
> Key: BEAM-8446
> URL: https://issues.apache.org/jira/browse/BEAM-8446
> Project: Beam
>  Issue Type: New Feature
>  Components: test-failures
>Reporter: Boyuan Zhang
>Assignee: Pablo Estrada
>Priority: Major
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> test_big_query_write_new_types appears to be flaky in 
> beam_PostCommit_Python37 test suite.
> https://builds.apache.org/job/beam_PostCommit_Python37/733/
> https://builds.apache.org/job/beam_PostCommit_Python37/739/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-8446) apache_beam.io.gcp.bigquery_write_it_test.BigQueryWriteIntegrationTests.test_big_query_write_new_types is flaky

2019-10-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8446?focusedWorklogId=332832=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332832
 ]

ASF GitHub Bot logged work on BEAM-8446:


Author: ASF GitHub Bot
Created on: 23/Oct/19 19:58
Start Date: 23/Oct/19 19:58
Worklog Time Spent: 10m 
  Work Description: pabloem commented on issue #9855: [BEAM-8446] Retrying 
BQ query on timeouts
URL: https://github.com/apache/beam/pull/9855#issuecomment-545611045
 
 
   Run Python 3.7 PostCommit
   
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 332832)
Time Spent: 2h 20m  (was: 2h 10m)

> apache_beam.io.gcp.bigquery_write_it_test.BigQueryWriteIntegrationTests.test_big_query_write_new_types
>  is flaky
> ---
>
> Key: BEAM-8446
> URL: https://issues.apache.org/jira/browse/BEAM-8446
> Project: Beam
>  Issue Type: New Feature
>  Components: test-failures
>Reporter: Boyuan Zhang
>Assignee: Pablo Estrada
>Priority: Major
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> test_big_query_write_new_types appears to be flaky in 
> beam_PostCommit_Python37 test suite.
> https://builds.apache.org/job/beam_PostCommit_Python37/733/
> https://builds.apache.org/job/beam_PostCommit_Python37/739/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-8446) apache_beam.io.gcp.bigquery_write_it_test.BigQueryWriteIntegrationTests.test_big_query_write_new_types is flaky

2019-10-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8446?focusedWorklogId=332831=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332831
 ]

ASF GitHub Bot logged work on BEAM-8446:


Author: ASF GitHub Bot
Created on: 23/Oct/19 19:58
Start Date: 23/Oct/19 19:58
Worklog Time Spent: 10m 
  Work Description: pabloem commented on issue #9855: [BEAM-8446] Retrying 
BQ query on timeouts
URL: https://github.com/apache/beam/pull/9855#issuecomment-545611021
 
 
   Again timeout without errors.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 332831)
Time Spent: 2h 10m  (was: 2h)

> apache_beam.io.gcp.bigquery_write_it_test.BigQueryWriteIntegrationTests.test_big_query_write_new_types
>  is flaky
> ---
>
> Key: BEAM-8446
> URL: https://issues.apache.org/jira/browse/BEAM-8446
> Project: Beam
>  Issue Type: New Feature
>  Components: test-failures
>Reporter: Boyuan Zhang
>Assignee: Pablo Estrada
>Priority: Major
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> test_big_query_write_new_types appears to be flaky in 
> beam_PostCommit_Python37 test suite.
> https://builds.apache.org/job/beam_PostCommit_Python37/733/
> https://builds.apache.org/job/beam_PostCommit_Python37/739/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-7926) Visualize PCollection with Interactive Beam

2019-10-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7926?focusedWorklogId=332830=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332830
 ]

ASF GitHub Bot logged work on BEAM-7926:


Author: ASF GitHub Bot
Created on: 23/Oct/19 19:46
Start Date: 23/Oct/19 19:46
Worklog Time Spent: 10m 
  Work Description: KevinGG commented on pull request #9741: [BEAM-7926] 
Visualize PCollection
URL: https://github.com/apache/beam/pull/9741#discussion_r338243645
 
 

 ##
 File path: sdks/python/setup.py
 ##
 @@ -166,6 +166,14 @@ def get_version():
 'google-cloud-bigtable>=0.31.1,<1.1.0',
 ]
 
+INTERACTIVE_BEAM = [
+'facets-overview>=1.0.0,<2',
 
 Review comment:
   They are already sorted. Do you want me to also sort GCP_REQUIREMENTS?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 332830)
Time Spent: 13h  (was: 12h 50m)

> Visualize PCollection with Interactive Beam
> ---
>
> Key: BEAM-7926
> URL: https://issues.apache.org/jira/browse/BEAM-7926
> Project: Beam
>  Issue Type: New Feature
>  Components: runner-py-interactive
>Reporter: Ning Kang
>Assignee: Ning Kang
>Priority: Major
>  Time Spent: 13h
>  Remaining Estimate: 0h
>
> Support auto plotting / charting of materialized data of a given PCollection 
> with Interactive Beam.
> Say an Interactive Beam pipeline defined as
> p = create_pipeline()
> pcoll = p | 'Transform' >> transform()
> The use can call a single function and get auto-magical charting of the data 
> as materialized pcoll.
> e.g., visualize(pcoll)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-8397) DataflowRunnerTest.test_remote_runner_display_data fails due to infinite recursion during pickling.

2019-10-23 Thread Valentyn Tymofieiev (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-8397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958195#comment-16958195
 ] 

Valentyn Tymofieiev commented on BEAM-8397:
---

For some reason, after extracting inner subclasses SpecialDoFn, SpecialParDo 
out of test_remote_runner_display_data makes the test fail, because 1 out of 3 
expected display_data entries does not appear:

{noformat}
Traceback (most recent call last):
 File 
"/usr/local/google/home/valentyn/projects/beam/beam/beam/sdks/python/apache_beam/runners/dataflow/dataflow_runner_test.py",
 line 290, in test_remote_runner_display_data
 self.assertEqual(len(disp_data), 3)
AssertionError: 2 != 3
{noformat}

Examination of output shows that TIMESTAMP display data entry is missing, while 
INT and STRING ones are present.  Sounds like another issue...
 

 

> DataflowRunnerTest.test_remote_runner_display_data fails due to infinite 
> recursion during pickling.
> ---
>
> Key: BEAM-8397
> URL: https://issues.apache.org/jira/browse/BEAM-8397
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Valentyn Tymofieiev
>Priority: Major
>
> `python ./setup.py test -s 
> apache_beam.runners.dataflow.dataflow_runner_test.DataflowRunnerTest.test_remote_runner_display_data`
>  passes.
> `tox -e py37-gcp` passes if Beam depends on dill==0.3.0, but fails if Beam 
> depends on dill==0.3.1.1.`python ./setup.py nosetests --tests 
> 'apache_beam/runners/dataflow/dataflow_runner_test.py:DataflowRunnerTest.test_remote_runner_display_data`
>  fails currently if run on master.
> The failure indicates infinite recursion during pickling:
> {noformat}
> test_remote_runner_display_data 
> (apache_beam.runners.dataflow.dataflow_runner_test.DataflowRunnerTest) ... 
> Fatal Python error: Cannot recover from stack overflow.
> Current thread 0x7f9d700ed740 (most recent call first):
>   File "/usr/lib/python3.7/pickle.py", line 479 in get
>   File "/usr/lib/python3.7/pickle.py", line 497 in save
>   File "/usr/lib/python3.7/pickle.py", line 786 in save_tuple
>   File "/usr/lib/python3.7/pickle.py", line 504 in save
>   File "/usr/lib/python3.7/pickle.py", line 638 in save_reduce
>   File 
> "/usr/local/google/home/valentyn/tmp/py37env/lib/python3.7/site-packages/dill/_dill.py",
>  line 1394 in save_function
>   File "/usr/lib/python3.7/pickle.py", line 504 in save
>   File "/usr/lib/python3.7/pickle.py", line 882 in _batch_setitems
>   File "/usr/lib/python3.7/pickle.py", line 856 in save_dict
>   File 
> "/usr/local/google/home/valentyn/tmp/py37env/lib/python3.7/site-packages/dill/_dill.py",
>  line 910 in save_module_dict
>   File 
> "/usr/local/google/home/valentyn/projects/beam/clean/beam/sdks/python/apache_beam/internal/pickler.py",
>  line 198 in new_save_module_dict
>   File "/usr/lib/python3.7/pickle.py", line 504 in save
>   File "/usr/lib/python3.7/pickle.py", line 786 in save_tuple
>   File "/usr/lib/python3.7/pickle.py", line 504 in save
>   File "/usr/lib/python3.7/pickle.py", line 638 in save_reduce
>   File 
> "/usr/local/google/home/valentyn/projects/beam/clean/beam/sdks/python/apache_beam/internal/pickler.py",
>  line 114 in wrapper
>   File "/usr/lib/python3.7/pickle.py", line 504 in save
>   File "/usr/lib/python3.7/pickle.py", line 771 in save_tuple
>   File "/usr/lib/python3.7/pickle.py", line 504 in save
>   File "/usr/lib/python3.7/pickle.py", line 638 in save_reduce
>   File 
> "/usr/local/google/home/valentyn/tmp/py37env/lib/python3.7/site-packages/dill/_dill.py",
>  line 1137 in save_cell
>   File "/usr/lib/python3.7/pickle.py", line 504 in save
>   File "/usr/lib/python3.7/pickle.py", line 771 in save_tuple
>   File "/usr/lib/python3.7/pickle.py", line 504 in save
>   File "/usr/lib/python3.7/pickle.py", line 786 in save_tuple
>   File "/usr/lib/python3.7/pickle.py", line 504 in save
>   File "/usr/lib/python3.7/pickle.py", line 638 in save_reduce
>   File 
> "/usr/local/google/home/valentyn/tmp/py37env/lib/python3.7/site-packages/dill/_dill.py",
>  line 1394 in save_function
>   File "/usr/lib/python3.7/pickle.py", line 504 in save
>   File "/usr/lib/python3.7/pickle.py", line 882 in _batch_setitems
>   File "/usr/lib/python3.7/pickle.py", line 856 in save_dict
>   File 
> "/usr/local/google/home/valentyn/tmp/py37env/lib/python3.7/site-packages/dill/_dill.py",
>  line 910 in save_module_dict
>   File 
> "/usr/local/google/home/valentyn/projects/beam/clean/beam/sdks/python/apache_beam/internal/pickler.py",
>  line 198 in new_save_module_dict
> ...
> {noformat}
> cc: [~yoshiki.obata]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (BEAM-8397) DataflowRunnerTest.test_remote_runner_display_data fails due to infinite recursion during pickling.

2019-10-23 Thread Valentyn Tymofieiev (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-8397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958195#comment-16958195
 ] 

Valentyn Tymofieiev edited comment on BEAM-8397 at 10/23/19 7:44 PM:
-

For some reason, after extracting inner subclasses SpecialDoFn, SpecialParDo 
out of test_remote_runner_display_data makes the test fail, because 1 out of 3 
expected display_data entries does not appear:

{noformat}
Traceback (most recent call last):
 File 
"/usr/local/google/home/valentyn/projects/beam/beam/beam/sdks/python/apache_beam/runners/dataflow/dataflow_runner_test.py",
 line 290, in test_remote_runner_display_data
 self.assertEqual(len(disp_data), 3)
AssertionError: 2 != 3
{noformat}

Examination of output shows that TIMESTAMP display data entry is missing, while 
INT and STRING ones are present.  Sounds like another issue...
 

 


was (Author: tvalentyn):
For some reason, after extracting inner subclasses SpecialDoFn, SpecialParDo 
out of test_remote_runner_display_data makes the test fail, because 1 out of 3 
expected display_data entries does not appear:

{noformat}
Traceback (most recent call last):
 File 
"/usr/local/google/home/valentyn/projects/beam/beam/beam/sdks/python/apache_beam/runners/dataflow/dataflow_runner_test.py",
 line 290, in test_remote_runner_display_data
 self.assertEqual(len(disp_data), 3)
AssertionError: 2 != 3
{noformat}

Examination of output shows that TIMESTAMP display data entry is missing, while 
INT and STRING ones are present.  Sounds like another issue...
 

 

> DataflowRunnerTest.test_remote_runner_display_data fails due to infinite 
> recursion during pickling.
> ---
>
> Key: BEAM-8397
> URL: https://issues.apache.org/jira/browse/BEAM-8397
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: Valentyn Tymofieiev
>Priority: Major
>
> `python ./setup.py test -s 
> apache_beam.runners.dataflow.dataflow_runner_test.DataflowRunnerTest.test_remote_runner_display_data`
>  passes.
> `tox -e py37-gcp` passes if Beam depends on dill==0.3.0, but fails if Beam 
> depends on dill==0.3.1.1.`python ./setup.py nosetests --tests 
> 'apache_beam/runners/dataflow/dataflow_runner_test.py:DataflowRunnerTest.test_remote_runner_display_data`
>  fails currently if run on master.
> The failure indicates infinite recursion during pickling:
> {noformat}
> test_remote_runner_display_data 
> (apache_beam.runners.dataflow.dataflow_runner_test.DataflowRunnerTest) ... 
> Fatal Python error: Cannot recover from stack overflow.
> Current thread 0x7f9d700ed740 (most recent call first):
>   File "/usr/lib/python3.7/pickle.py", line 479 in get
>   File "/usr/lib/python3.7/pickle.py", line 497 in save
>   File "/usr/lib/python3.7/pickle.py", line 786 in save_tuple
>   File "/usr/lib/python3.7/pickle.py", line 504 in save
>   File "/usr/lib/python3.7/pickle.py", line 638 in save_reduce
>   File 
> "/usr/local/google/home/valentyn/tmp/py37env/lib/python3.7/site-packages/dill/_dill.py",
>  line 1394 in save_function
>   File "/usr/lib/python3.7/pickle.py", line 504 in save
>   File "/usr/lib/python3.7/pickle.py", line 882 in _batch_setitems
>   File "/usr/lib/python3.7/pickle.py", line 856 in save_dict
>   File 
> "/usr/local/google/home/valentyn/tmp/py37env/lib/python3.7/site-packages/dill/_dill.py",
>  line 910 in save_module_dict
>   File 
> "/usr/local/google/home/valentyn/projects/beam/clean/beam/sdks/python/apache_beam/internal/pickler.py",
>  line 198 in new_save_module_dict
>   File "/usr/lib/python3.7/pickle.py", line 504 in save
>   File "/usr/lib/python3.7/pickle.py", line 786 in save_tuple
>   File "/usr/lib/python3.7/pickle.py", line 504 in save
>   File "/usr/lib/python3.7/pickle.py", line 638 in save_reduce
>   File 
> "/usr/local/google/home/valentyn/projects/beam/clean/beam/sdks/python/apache_beam/internal/pickler.py",
>  line 114 in wrapper
>   File "/usr/lib/python3.7/pickle.py", line 504 in save
>   File "/usr/lib/python3.7/pickle.py", line 771 in save_tuple
>   File "/usr/lib/python3.7/pickle.py", line 504 in save
>   File "/usr/lib/python3.7/pickle.py", line 638 in save_reduce
>   File 
> "/usr/local/google/home/valentyn/tmp/py37env/lib/python3.7/site-packages/dill/_dill.py",
>  line 1137 in save_cell
>   File "/usr/lib/python3.7/pickle.py", line 504 in save
>   File "/usr/lib/python3.7/pickle.py", line 771 in save_tuple
>   File "/usr/lib/python3.7/pickle.py", line 504 in save
>   File "/usr/lib/python3.7/pickle.py", line 786 in save_tuple
>   File "/usr/lib/python3.7/pickle.py", line 504 in save
>   File "/usr/lib/python3.7/pickle.py", line 638 in save_reduce
>   File 
> 

[jira] [Work logged] (BEAM-7926) Visualize PCollection with Interactive Beam

2019-10-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7926?focusedWorklogId=332826=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332826
 ]

ASF GitHub Bot logged work on BEAM-7926:


Author: ASF GitHub Bot
Created on: 23/Oct/19 19:37
Start Date: 23/Oct/19 19:37
Worklog Time Spent: 10m 
  Work Description: KevinGG commented on pull request #9741: [BEAM-7926] 
Visualize PCollection
URL: https://github.com/apache/beam/pull/9741#discussion_r338240149
 
 

 ##
 File path: 
sdks/python/apache_beam/runners/interactive/display/pcoll_visualization_test.py
 ##
 @@ -0,0 +1,152 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Tests for apache_beam.runners.interactive.display.pcoll_visualization."""
+from __future__ import absolute_import
+
+import sys
+import time
+import unittest
+
+import apache_beam as beam  # pylint: disable=ungrouped-imports
+import timeloop
+from apache_beam.runners import runner
+from apache_beam.runners.interactive import interactive_environment as ie
+from apache_beam.runners.interactive.display import pcoll_visualization as pv
+
+# Work around nose tests using Python2 without unittest.mock module.
+try:
+  from unittest.mock import patch
+except ImportError:
+  from mock import patch
+
+
+class PCollVisualizationTest(unittest.TestCase):
+
+  def setUp(self):
+self._p = beam.Pipeline()
+# pylint: disable=range-builtin-not-iterating
+self._pcoll = self._p | 'Create' >> beam.Create(range(1000))
+
+  @unittest.skipIf(sys.version_info < (3, 5, 3),
+   'PCollVisualization is not supported on Python 2.')
+  def test_raise_error_for_non_pcoll_input(self):
+class Foo(object):
+  pass
+
+with self.assertRaises(AssertionError) as ctx:
+  pv.PCollVisualization(Foo())
+  self.assertTrue('pcoll should be apache_beam.pvalue.PCollection' in
+  ctx.exception)
+
+  @unittest.skipIf(sys.version_info < (3, 5, 3),
+   'PCollVisualization is not supported on Python 2.')
+  def test_pcoll_visualization_generate_unique_display_id(self):
+pv_1 = pv.PCollVisualization(self._pcoll)
+pv_2 = pv.PCollVisualization(self._pcoll)
+self.assertNotEqual(pv_1._dive_display_id, pv_2._dive_display_id)
+self.assertNotEqual(pv_1._overview_display_id, pv_2._overview_display_id)
+self.assertNotEqual(pv_1._df_display_id, pv_2._df_display_id)
+
+  @unittest.skipIf(sys.version_info < (3, 5, 3),
+   'PCollVisualization is not supported on Python 2.')
+  @patch('apache_beam.runners.interactive.display.pcoll_visualization'
+ '.PCollVisualization._to_element_list', lambda x: [1, 2, 3])
+  def test_one_shot_visualization_not_return_handle(self):
+self.assertIsNone(pv.visualize(self._pcoll))
+
+  def _mock_to_element_list(self):
+yield [1, 2, 3]
+yield [1, 2, 3, 4]
+yield [1, 2, 3, 4, 5]
+yield [1, 2, 3, 4, 5, 6]
+yield [1, 2, 3, 4, 5, 6, 7]
+yield [1, 2, 3, 4, 5, 6, 7, 8]
+
+  @unittest.skipIf(sys.version_info < (3, 5, 3),
+   'PCollVisualization is not supported on Python 2.')
+  @patch('apache_beam.runners.interactive.display.pcoll_visualization'
+ '.PCollVisualization._to_element_list', _mock_to_element_list)
+  def test_dynamic_plotting_return_handle(self):
+h = pv.visualize(self._pcoll, dynamic_plotting_interval=1)
+self.assertIsInstance(h, timeloop.Timeloop)
+h.stop()
+
+  @unittest.skipIf(sys.version_info < (3, 5, 3),
+   'PCollVisualization is not supported on Python 2.')
+  @patch('apache_beam.runners.interactive.display.pcoll_visualization'
+ '.PCollVisualization._to_element_list', _mock_to_element_list)
+  @patch('apache_beam.runners.interactive.display.pcoll_visualization'
+ '.PCollVisualization.display_facets')
+  def test_dynamic_plotting_update_same_display(self,
+mocked_display_facets):
+fake_pipeline_result = runner.PipelineResult(runner.PipelineState.RUNNING)
+ie.current_env().set_pipeline_result(self._p, fake_pipeline_result)
+# Starts async dynamic plotting that never 

[jira] [Work logged] (BEAM-7926) Visualize PCollection with Interactive Beam

2019-10-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7926?focusedWorklogId=332802=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332802
 ]

ASF GitHub Bot logged work on BEAM-7926:


Author: ASF GitHub Bot
Created on: 23/Oct/19 18:57
Start Date: 23/Oct/19 18:57
Worklog Time Spent: 10m 
  Work Description: KevinGG commented on pull request #9741: [BEAM-7926] 
Visualize PCollection
URL: https://github.com/apache/beam/pull/9741#discussion_r338223025
 
 

 ##
 File path: 
sdks/python/apache_beam/runners/interactive/display/pcoll_visualization.py
 ##
 @@ -0,0 +1,279 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module visualizes PCollection data.
+
+For internal use only; no backwards-compatibility guarantees.
+Only works with Python 3.5+.
+"""
+from __future__ import absolute_import
+
+import base64
+import logging
+from datetime import timedelta
+
+from pandas.io.json import json_normalize
+
+from apache_beam import pvalue
+from apache_beam.runners.interactive import interactive_environment as ie
+from apache_beam.runners.interactive import pipeline_instrument as instr
+
+# jsons doesn't support < Python 3.5. Work around with json for legacy tests.
+# TODO(BEAM-8288): clean up once Py2 is deprecated from Beam.
+try:
+  import jsons
+  _pv_jsons_load = jsons.load
+  _pv_jsons_dump = jsons.dump
+except ImportError:
+  import json
+  _pv_jsons_load = json.load
+  _pv_jsons_dump = json.dump
+
+try:
+  from facets_overview.generic_feature_statistics_generator import 
GenericFeatureStatisticsGenerator
+  _facets_gfsg_ready = True
+except ImportError:
+  _facets_gfsg_ready = False
+
+try:
+  from IPython.core.display import HTML
+  from IPython.core.display import Javascript
+  from IPython.core.display import display
+  from IPython.core.display import display_javascript
+  from IPython.core.display import update_display
+  _ipython_ready = True
+except ImportError:
+  _ipython_ready = False
+
+try:
+  from timeloop import Timeloop
+  _tl_ready = True
+except ImportError:
+  _tl_ready = False
+
+# 1-d types that need additional normalization to be compatible with DataFrame.
+_one_dimension_types = (int, float, str, bool, list, tuple)
+
+_DIVE_SCRIPT_TEMPLATE = """
+document.querySelector("#{display_id}").data = {jsonstr};"""
+_DIVE_HTML_TEMPLATE = """
+https://cdnjs.cloudflare.com/ajax/libs/webcomponentsjs/1.3.3/webcomponents-lite.js";>
+https://raw.githubusercontent.com/PAIR-code/facets/1.0.0/facets-dist/facets-jupyter.html;>
+
+
+  document.querySelector("#{display_id}").data = {jsonstr};
+"""
+_OVERVIEW_SCRIPT_TEMPLATE = """
+  document.querySelector("#{display_id}").protoInput = 
"{protostr}";
+  """
+_OVERVIEW_HTML_TEMPLATE = """
+https://cdnjs.cloudflare.com/ajax/libs/webcomponentsjs/1.3.3/webcomponents-lite.js";>
+https://raw.githubusercontent.com/PAIR-code/facets/1.0.0/facets-dist/facets-jupyter.html;>
+
+
+  document.querySelector("#{display_id}").protoInput = 
"{protostr}";
+"""
+_DATAFRAME_PAGINATION_TEMPLATE = """
+https://ajax.googleapis.com/ajax/libs/jquery/2.2.2/jquery.min.js";>
 
+https://cdn.datatables.net/1.10.16/js/jquery.dataTables.js";> 
+https://cdn.datatables.net/1.10.16/css/jquery.dataTables.css;>
+{dataframe_html}
+
+  $("#{table_id}").DataTable();
+"""
+
+
+def visualize(pcoll, dynamic_plotting_interval=None):
+  """Visualizes the data of a given PCollection. Optionally enables dynamic
+  plotting with interval in seconds if the PCollection is being produced by a
+  running pipeline or the pipeline is streaming indefinitely. The function
+  always returns immediately and is asynchronous when dynamic plotting is on.
+
+  If dynamic plotting enabled, the visualization is updated continuously until
+  the pipeline producing the PCollection is in an end state. The visualization
+  would be anchored to the notebook cell output 

[jira] [Work logged] (BEAM-7926) Visualize PCollection with Interactive Beam

2019-10-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7926?focusedWorklogId=332801=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332801
 ]

ASF GitHub Bot logged work on BEAM-7926:


Author: ASF GitHub Bot
Created on: 23/Oct/19 18:56
Start Date: 23/Oct/19 18:56
Worklog Time Spent: 10m 
  Work Description: KevinGG commented on pull request #9741: [BEAM-7926] 
Visualize PCollection
URL: https://github.com/apache/beam/pull/9741#discussion_r338222750
 
 

 ##
 File path: sdks/python/setup.py
 ##
 @@ -219,11 +227,15 @@ def run(self):
 install_requires=REQUIRED_PACKAGES,
 python_requires=python_requires,
 test_suite='nose.collector',
-tests_require=REQUIRED_TEST_PACKAGES,
+tests_require=[
+REQUIRED_TEST_PACKAGES,
+INTERACTIVE_BEAM,
 
 Review comment:
   Yes, I agree! This could avoid running those tests when python version is 
under 3.5.3.
   I guess then the only task would require [interactive] for older versioned 
python will be "docs" where it scans the source code. I'll just add the 
try-import statements everywhere to work around it.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 332801)
Time Spent: 12.5h  (was: 12h 20m)

> Visualize PCollection with Interactive Beam
> ---
>
> Key: BEAM-7926
> URL: https://issues.apache.org/jira/browse/BEAM-7926
> Project: Beam
>  Issue Type: New Feature
>  Components: runner-py-interactive
>Reporter: Ning Kang
>Assignee: Ning Kang
>Priority: Major
>  Time Spent: 12.5h
>  Remaining Estimate: 0h
>
> Support auto plotting / charting of materialized data of a given PCollection 
> with Interactive Beam.
> Say an Interactive Beam pipeline defined as
> p = create_pipeline()
> pcoll = p | 'Transform' >> transform()
> The use can call a single function and get auto-magical charting of the data 
> as materialized pcoll.
> e.g., visualize(pcoll)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-7926) Visualize PCollection with Interactive Beam

2019-10-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7926?focusedWorklogId=332798=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332798
 ]

ASF GitHub Bot logged work on BEAM-7926:


Author: ASF GitHub Bot
Created on: 23/Oct/19 18:48
Start Date: 23/Oct/19 18:48
Worklog Time Spent: 10m 
  Work Description: KevinGG commented on pull request #9741: [BEAM-7926] 
Visualize PCollection
URL: https://github.com/apache/beam/pull/9741#discussion_r338218993
 
 

 ##
 File path: 
sdks/python/apache_beam/runners/interactive/interactive_environment_test.py
 ##
 @@ -88,6 +91,72 @@ def test_watch_class_instance(self):
 self.assertVariableWatched('_var_in_class_instance',
self._var_in_class_instance)
 
+  def test_fail_to_set_pipeline_result_key_not_pipeline(self):
+class NotPipeline(object):
+  pass
+
+with self.assertRaises(AssertionError) as ctx:
+  ie.current_env().set_pipeline_result(NotPipeline(),
+   runner.PipelineResult(
+   runner.PipelineState.RUNNING))
+  self.assertTrue('pipeline must be an instance of apache_beam.Pipeline '
+  'or its subclass' in ctx.exception)
+
+  def test_fail_to_set_pipeline_result_value_not_pipeline_result(self):
+class NotResult(object):
+  pass
+
+with self.assertRaises(AssertionError) as ctx:
+  ie.current_env().set_pipeline_result(self._p, NotResult())
+  self.assertTrue('result must be an instance of '
+  'apache_beam.runners.runner.PipelineResult or its '
+  'subclass' in ctx.exception)
+
+  def test_set_pipeline_result_successfully(self):
+class PipelineSubClass(beam.Pipeline):
+  pass
+
+class PipelineResultSubClass(runner.PipelineResult):
+  pass
+
+pipeline = PipelineSubClass()
+pipeline_result = PipelineResultSubClass(runner.PipelineState.RUNNING)
+ie.current_env().set_pipeline_result(pipeline, pipeline_result)
+self.assertIs(ie.current_env().pipeline_result(pipeline), pipeline_result)
+
+  def test_determine_terminal_state(self):
+for state in (runner.PipelineState.DONE,
+  runner.PipelineState.FAILED,
+  runner.PipelineState.CANCELLED,
+  runner.PipelineState.UPDATED,
+  runner.PipelineState.DRAINED):
+  ie.current_env().set_pipeline_result(self._p, runner.PipelineResult(
 
 Review comment:
   Great, thanks! I totally missed it. I'll change it to use that class method.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 332798)
Time Spent: 12h 20m  (was: 12h 10m)

> Visualize PCollection with Interactive Beam
> ---
>
> Key: BEAM-7926
> URL: https://issues.apache.org/jira/browse/BEAM-7926
> Project: Beam
>  Issue Type: New Feature
>  Components: runner-py-interactive
>Reporter: Ning Kang
>Assignee: Ning Kang
>Priority: Major
>  Time Spent: 12h 20m
>  Remaining Estimate: 0h
>
> Support auto plotting / charting of materialized data of a given PCollection 
> with Interactive Beam.
> Say an Interactive Beam pipeline defined as
> p = create_pipeline()
> pcoll = p | 'Transform' >> transform()
> The use can call a single function and get auto-magical charting of the data 
> as materialized pcoll.
> e.g., visualize(pcoll)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-7926) Visualize PCollection with Interactive Beam

2019-10-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7926?focusedWorklogId=332797=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332797
 ]

ASF GitHub Bot logged work on BEAM-7926:


Author: ASF GitHub Bot
Created on: 23/Oct/19 18:47
Start Date: 23/Oct/19 18:47
Worklog Time Spent: 10m 
  Work Description: KevinGG commented on pull request #9741: [BEAM-7926] 
Visualize PCollection
URL: https://github.com/apache/beam/pull/9741#discussion_r338218476
 
 

 ##
 File path: 
sdks/python/apache_beam/runners/interactive/interactive_environment.py
 ##
 @@ -61,8 +66,20 @@ def __init__(self, cache_manager=None):
 self._watching_set = set()
 # Holds variables list of (Dict[str, object]).
 self._watching_dict_list = []
+# Holds results of pipeline runs as Dict[Pipeline, PipelineResult].
+# Each key is a pipeline instance defined by the end user. The
+# InteractiveRunner is responsible for populating this dictionary
+# implicitly.
+self._pipeline_results = {}
 # Always watch __main__ module.
 self.watch('__main__')
+# Do a warning level logging if current python version is below 2 or 3.5.3.
+if sys.version_info < (3,):
+  logging.warning('Interactive Beam does not support Python 2.')
+elif sys.version_info < (3, 5, 3):
 
 Review comment:
   We can. I'll make it simpler. Let's just consider every different component 
a big feature. The feature requires Py 3.5.3. And we NOOP if dependency is not 
ready.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 332797)
Time Spent: 12h 10m  (was: 12h)

> Visualize PCollection with Interactive Beam
> ---
>
> Key: BEAM-7926
> URL: https://issues.apache.org/jira/browse/BEAM-7926
> Project: Beam
>  Issue Type: New Feature
>  Components: runner-py-interactive
>Reporter: Ning Kang
>Assignee: Ning Kang
>Priority: Major
>  Time Spent: 12h 10m
>  Remaining Estimate: 0h
>
> Support auto plotting / charting of materialized data of a given PCollection 
> with Interactive Beam.
> Say an Interactive Beam pipeline defined as
> p = create_pipeline()
> pcoll = p | 'Transform' >> transform()
> The use can call a single function and get auto-magical charting of the data 
> as materialized pcoll.
> e.g., visualize(pcoll)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-8368) [Python] libprotobuf-generated exception when importing apache_beam

2019-10-23 Thread Brian Hulette (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-8368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958159#comment-16958159
 ] 

Brian Hulette commented on BEAM-8368:
-

Beam 2.17 cut date is today, should we keep this as a blocker and potentially 
cherry-pick in the next couple of days if there's an arrow release to use? Or 
just punt until 2.18?

> [Python] libprotobuf-generated exception when importing apache_beam
> ---
>
> Key: BEAM-8368
> URL: https://issues.apache.org/jira/browse/BEAM-8368
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Affects Versions: 2.15.0, 2.17.0
>Reporter: Ubaier Bhat
>Assignee: Brian Hulette
>Priority: Blocker
> Fix For: 2.17.0
>
> Attachments: error_log.txt
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Unable to import apache_beam after upgrading to macos 10.15 (Catalina). 
> Cleared all the pipenvs and but can't get it working again.
> {code}
> import apache_beam as beam
> /Users/***/.local/share/virtualenvs/beam-etl-ims6DitU/lib/python3.7/site-packages/apache_beam/__init__.py:84:
>  UserWarning: Some syntactic constructs of Python 3 are not yet fully 
> supported by Apache Beam.
>   'Some syntactic constructs of Python 3 are not yet fully supported by '
> [libprotobuf ERROR google/protobuf/descriptor_database.cc:58] File already 
> exists in database: 
> [libprotobuf FATAL google/protobuf/descriptor.cc:1370] CHECK failed: 
> GeneratedDatabase()->Add(encoded_file_descriptor, size): 
> libc++abi.dylib: terminating with uncaught exception of type 
> google::protobuf::FatalException: CHECK failed: 
> GeneratedDatabase()->Add(encoded_file_descriptor, size): 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-4775) JobService should support returning metrics

2019-10-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-4775?focusedWorklogId=332795=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332795
 ]

ASF GitHub Bot logged work on BEAM-4775:


Author: ASF GitHub Bot
Created on: 23/Oct/19 18:46
Start Date: 23/Oct/19 18:46
Worklog Time Spent: 10m 
  Work Description: pabloem commented on issue #9843: [BEAM-4775] 
Converting MonitoringInfos to MetricResults in PortableRunner
URL: https://github.com/apache/beam/pull/9843#issuecomment-545583207
 
 
   We'll have to rely on r:@Ardagan  as Alex is not working on this anymore.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 332795)
Time Spent: 50h 20m  (was: 50h 10m)

> JobService should support returning metrics
> ---
>
> Key: BEAM-4775
> URL: https://issues.apache.org/jira/browse/BEAM-4775
> Project: Beam
>  Issue Type: Bug
>  Components: beam-model
>Reporter: Eugene Kirpichov
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 50h 20m
>  Remaining Estimate: 0h
>
> Design doc: [https://s.apache.org/get-metrics-api].
> Further discussion is ongoing on [this 
> doc|https://docs.google.com/document/d/1m83TsFvJbOlcLfXVXprQm1B7vUakhbLZMzuRrOHWnTg/edit?ts=5c826bb4#heading=h.faqan9rjc6dm].
> We want to report job metrics back to the portability harness from the runner 
> harness, for displaying to users.
> h1. Relevant PRs in flight:
> h2. Ready for Review:
>  * [#8022|https://github.com/apache/beam/pull/8022]: correct the Job RPC 
> protos from [#8018|https://github.com/apache/beam/pull/8018].
> h2. Iterating / Discussing:
>  * [#7971|https://github.com/apache/beam/pull/7971]: Flink portable metrics: 
> get ptransform from MonitoringInfo, not stage name
>  ** this is a simpler, Flink-specific PR that is basically duplicated inside 
> each of the following two, so may be worth trying to merge in first
>  * #[7915|https://github.com/apache/beam/pull/7915]: use MonitoringInfo data 
> model in Java SDK metrics
>  * [#7868|https://github.com/apache/beam/pull/7868]: MonitoringInfo URN tweaks
> h2. Merged
>  * [#8018|https://github.com/apache/beam/pull/8018]: add job metrics RPC 
> protos
>  * [#7867|https://github.com/apache/beam/pull/7867]: key MetricResult by a 
> MetricKey
>  * [#7938|https://github.com/apache/beam/pull/7938]: move MonitoringInfo 
> protos to model/pipeline module
>  * [#7883|https://github.com/apache/beam/pull/7883]: Add 
> MetricQueryResults.allMetrics() helper
>  * [#7866|https://github.com/apache/beam/pull/7866]: move function helpers 
> from fn-harness to sdks/java/core
>  * [#7890|https://github.com/apache/beam/pull/7890]: consolidate MetricResult 
> implementations
> h2. Closed
>  * [#7934|https://github.com/apache/beam/pull/7934]: job metrics RPC + SDK 
> support
>  * [#7876|https://github.com/apache/beam/pull/7876]: Clean up metric protos; 
> support integer distributions, gauges



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-10-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=332793=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332793
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 23/Oct/19 18:45
Start Date: 23/Oct/19 18:45
Worklog Time Spent: 10m 
  Work Description: pabloem commented on pull request #9772: [BEAM-1440] 
Create a BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#discussion_r337775537
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigquery_read_it_test.py
 ##
 @@ -86,23 +117,61 @@ def create_table(self, tablename):
 table_schema.fields.append(table_field)
 table = bigquery.Table(
 tableReference=bigquery.TableReference(
-projectId=self.project,
-datasetId=self.dataset_id,
-tableId=tablename),
+projectId=cls.project,
+datasetId=cls.dataset_id,
+tableId=table_name),
 schema=table_schema)
 request = bigquery.BigqueryTablesInsertRequest(
-projectId=self.project, datasetId=self.dataset_id, table=table)
-self.bigquery_client.client.tables.Insert(request)
+projectId=cls.project, datasetId=cls.dataset_id, table=table)
+cls.bigquery_client.client.tables.Insert(request)
 table_data = [
 {'number': 1, 'str': 'abc'},
 {'number': 2, 'str': 'def'},
 {'number': 3, 'str': u'你好'},
 {'number': 4, 'str': u'привет'}
 ]
-self.bigquery_client.insert_rows(
-self.project, self.dataset_id, tablename, table_data)
+cls.bigquery_client.insert_rows(
+cls.project, cls.dataset_id, table_name, table_data)
 
-  def create_table_new_types(self, table_name):
+  def get_expected_data(self):
 
 Review comment:
   Make this a constant, and use it to create the table, and to match in the 
asserts.
   ```
   TABLE_DATA = [
   {'number': 1, 'str': 'abc'},
   {'number': 2, 'str': 'def'},
   {'number': 3, 'str': u'你好'},
   {'number': 4, 'str': u'привет'}
   ]
   ```
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 332793)

> Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
> --
>
> Key: BEAM-1440
> URL: https://issues.apache.org/jira/browse/BEAM-1440
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 4h 40m
>  Remaining Estimate: 0h
>
> Currently we have a BigQuery native source for Python SDK [1].
> This can only be used by Dataflow runner.
> We should  implement a Beam BigQuery source that implements 
> iobase.BoundedSource [2] interface so that other runners that try to use 
> Python SDK can read from BigQuery as well. Java SDK already has a Beam 
> BigQuery source [3].
> [1] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py
> [2] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70
> [3] 
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-10-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=332789=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332789
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 23/Oct/19 18:45
Start Date: 23/Oct/19 18:45
Worklog Time Spent: 10m 
  Work Description: pabloem commented on pull request #9772: [BEAM-1440] 
Create a BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#discussion_r337778293
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigquery_read_it_test.py
 ##
 @@ -86,23 +117,61 @@ def create_table(self, tablename):
 table_schema.fields.append(table_field)
 table = bigquery.Table(
 tableReference=bigquery.TableReference(
-projectId=self.project,
-datasetId=self.dataset_id,
-tableId=tablename),
+projectId=cls.project,
+datasetId=cls.dataset_id,
+tableId=table_name),
 schema=table_schema)
 request = bigquery.BigqueryTablesInsertRequest(
-projectId=self.project, datasetId=self.dataset_id, table=table)
-self.bigquery_client.client.tables.Insert(request)
+projectId=cls.project, datasetId=cls.dataset_id, table=table)
+cls.bigquery_client.client.tables.Insert(request)
 table_data = [
 {'number': 1, 'str': 'abc'},
 {'number': 2, 'str': 'def'},
 {'number': 3, 'str': u'你好'},
 {'number': 4, 'str': u'привет'}
 ]
-self.bigquery_client.insert_rows(
-self.project, self.dataset_id, tablename, table_data)
+cls.bigquery_client.insert_rows(
+cls.project, cls.dataset_id, table_name, table_data)
 
-  def create_table_new_types(self, table_name):
+  def get_expected_data(self):
+return [
+{'number': 1, 'str': 'abc'},
+{'number': 2, 'str': 'def'},
+{'number': 3, 'str': u'你好'},
+{'number': 4, 'str': u'привет'}
+]
+
+  @skip(['PortableRunner', 'FlinkRunner'])
+  @attr('IT')
+  def test_native_source(self):
+with beam.Pipeline(argv=self.args) as p:
+  result = (p | 'read' >> beam.io.Read(beam.io.BigQuerySource(
+  query=self.query, use_standard_sql=True)))
+  assert_that(result, equal_to(self.get_expected_data()))
+
+  @skip(['DirectRunner', 'TestDirectRunner'])
+  @attr('IT')
+  def test_iobase_source(self):
+with beam.Pipeline(argv=self.args) as p:
+  result = (p | 'read' >> beam.io.ReadFromBigQuery(
+  query=self.query, use_standard_sql=True, project=self.project,
+  gcs_bucket_name='gs://temp-storage-for-end-to-end-tests'))
 
 Review comment:
   Why are we skipping this for DirectRunner? This should work there, right?
   `gcs_bucket_name` may need to be passed testpipeline arguments, in case it 
runs in a project that does not have access to that bucket (we run it 
internally at Google).
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 332789)
Time Spent: 4.5h  (was: 4h 20m)

> Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
> --
>
> Key: BEAM-1440
> URL: https://issues.apache.org/jira/browse/BEAM-1440
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> Currently we have a BigQuery native source for Python SDK [1].
> This can only be used by Dataflow runner.
> We should  implement a Beam BigQuery source that implements 
> iobase.BoundedSource [2] interface so that other runners that try to use 
> Python SDK can read from BigQuery as well. Java SDK already has a Beam 
> BigQuery source [3].
> [1] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py
> [2] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70
> [3] 
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-10-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=332790=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332790
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 23/Oct/19 18:45
Start Date: 23/Oct/19 18:45
Worklog Time Spent: 10m 
  Work Description: pabloem commented on pull request #9772: [BEAM-1440] 
Create a BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#discussion_r337762067
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigquery_tools.py
 ##
 @@ -370,7 +383,37 @@ def _start_query_job(self, project_id, query, 
use_legacy_sql, flatten_results,
 jobReference=reference))
 
 response = self.client.jobs.Insert(request)
-return response.jobReference.jobId, response.jobReference.location
+return response
+
+  def wait_for_bq_job(self, job_reference, sleep_duration_sec=5,
+  max_retries=60):
+"""Poll job until it is DONE.
+
+Args:
+  job_reference: bigquery.JobReference instance.
+  sleep_duration_sec: Specifies the delay in seconds between retries.
+  max_retries: The total number of times to retry. If equals to 0,
+the function waits forever.
+
+Raises:
+  `RuntimeError`: If the job is FAILED or the number of retries has been
+reached.
+"""
+retry = 0
+while True:
+  retry += 1
+  job = self.get_job(job_reference.projectId, job_reference.jobId,
+ job_reference.location)
+  logging.info('Job status: %s', job.status.state)
+  if job.status.state == 'DONE' and job.status.errorResult:
+raise RuntimeError("BigQuery job %s failed. Error Result: %s",
 
 Review comment:
   Python errors don't replace strings that way. You'll have to do
   ```
   raise RuntimeError("BigQuery job %s failed. Error Result: %s" 
   % (job_reference.jobId, job.status.errorResult))
   ```
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 332790)
Time Spent: 4.5h  (was: 4h 20m)

> Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
> --
>
> Key: BEAM-1440
> URL: https://issues.apache.org/jira/browse/BEAM-1440
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> Currently we have a BigQuery native source for Python SDK [1].
> This can only be used by Dataflow runner.
> We should  implement a Beam BigQuery source that implements 
> iobase.BoundedSource [2] interface so that other runners that try to use 
> Python SDK can read from BigQuery as well. Java SDK already has a Beam 
> BigQuery source [3].
> [1] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py
> [2] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70
> [3] 
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-10-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=332792=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332792
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 23/Oct/19 18:45
Start Date: 23/Oct/19 18:45
Worklog Time Spent: 10m 
  Work Description: pabloem commented on pull request #9772: [BEAM-1440] 
Create a BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#discussion_r337779603
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigquery.py
 ##
 @@ -496,6 +505,233 @@ def reader(self, test_bigquery_client=None):
 kms_key=self.kms_key)
 
 
+SchemaFields = collections.namedtuple('SchemaFields', 'fields mode name type')
+
+
+def _to_bool(value):
+  return value == 'true'
+
+
+def _to_decimal(value):
+  return decimal.Decimal(value)
+
+
+def _to_bytes(value):
+  """Converts value from str to bytes on Python 3.x. Does nothing on
+  Python 2.7."""
+  return value.encode('utf-8')
+
+
+class _BigQueryRowCoder(coders.Coder):
+  """A coder for a table row (represented as a dict) from a JSON string which
+  applies additional conversions.
+  """
+
+  def __init__(self, table_schema):
+# bigquery.TableSchema type is unpickable so we must translate it to a
+# pickable type
+self.fields = [SchemaFields(x.fields, x.mode, x.name, x.type)
+   for x in table_schema.fields]
+self._converters = {
+'INTEGER': int,
+'INT64': int,
+'FLOAT': float,
+'BOOLEAN': _to_bool,
+'NUMERIC': _to_decimal,
+'BYTES': _to_bytes,
+}
+
+  def decode(self, value):
+value = json.loads(value)
+for field in self.fields:
+  if field.name not in value:
+# The field exists in the schema, but it doesn't exist in this row.
+# It probably means its value was null, as the extract to JSON job
+# doesn't preserve null fields
+value[field.name] = None
+continue
+
+  try:
+converter = self._converters[field.type]
+value[field.name] = converter(value[field.name])
+  except KeyError:
+# No need to do any conversion
+pass
+return value
+
+  def is_deterministic(self):
+return True
+
+  def to_type_hint(self):
+return dict
+
+
+class _BigQuerySource(BoundedSource):
+  """Read data from BigQuery.
+
+This source uses a BigQuery export job to take a snapshot of the table
+on GCS, and then reads from each produced JSON file.
+
+Do note that currently this source does not work with DirectRunner.
+
+  Args:
+table (str, callable, ValueProvider): The ID of the table, or a callable
+  that returns it. The ID must contain only letters ``a-z``, ``A-Z``,
+  numbers ``0-9``, or underscores ``_``. If dataset argument is
+  :data:`None` then the table argument must contain the entire table
+  reference specified as: ``'DATASET.TABLE'``
+  or ``'PROJECT:DATASET.TABLE'``. If it's a callable, it must receive one
+  argument representing an element to be written to BigQuery, and return
+  a TableReference, or a string table name as specified above.
+dataset (str): The ID of the dataset containing this table or
+  :data:`None` if the table reference is specified entirely by the table
+  argument.
+project (str): The ID of the project containing this table.
+query (str): A query to be used instead of arguments table, dataset, and
+  project.
+  validate (bool): If :data:`True`, various checks will be done when source
+  gets initialized (e.g., is table present?). This should be
+  :data:`True` for most scenarios in order to catch errors as early as
+  possible (pipeline construction instead of pipeline execution). It
+  should be :data:`False` if the table is created during pipeline
+  execution by a previous step.
+coder (~apache_beam.coders.coders.Coder): The coder for the table
+  rows. If :data:`None`, then the default coder is
+  :class:`~apache_beam.io.gcp.bigquery._BigQueryRowCoder`,
+  which will interpret every line in a file as a JSON serialized
+  dictionary. This argument needs a value only in special cases when
+  returning table rows as dictionaries is not desirable.
+use_standard_sql (bool): Specifies whether to use BigQuery's standard SQL
+  dialect for this query. The default value is :data:`False`.
+  If set to :data:`True`, the query will use BigQuery's updated SQL
+  dialect with improved standards compliance.
+  This parameter is ignored for table inputs.
+flatten_results (bool): Flattens all nested and repeated fields in the
+  query results. The default value is :data:`True`.
+kms_key (str): Experimental. Optional Cloud KMS key name for use when
+  creating new tables.
+

[jira] [Work logged] (BEAM-4775) JobService should support returning metrics

2019-10-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-4775?focusedWorklogId=332794=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332794
 ]

ASF GitHub Bot logged work on BEAM-4775:


Author: ASF GitHub Bot
Created on: 23/Oct/19 18:45
Start Date: 23/Oct/19 18:45
Worklog Time Spent: 10m 
  Work Description: pabloem commented on issue #9843: [BEAM-4775] 
Converting MonitoringInfos to MetricResults in PortableRunner
URL: https://github.com/apache/beam/pull/9843#issuecomment-545583207
 
 
   We'll have to rely on r:@Ardagan 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 332794)
Time Spent: 50h 10m  (was: 50h)

> JobService should support returning metrics
> ---
>
> Key: BEAM-4775
> URL: https://issues.apache.org/jira/browse/BEAM-4775
> Project: Beam
>  Issue Type: Bug
>  Components: beam-model
>Reporter: Eugene Kirpichov
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 50h 10m
>  Remaining Estimate: 0h
>
> Design doc: [https://s.apache.org/get-metrics-api].
> Further discussion is ongoing on [this 
> doc|https://docs.google.com/document/d/1m83TsFvJbOlcLfXVXprQm1B7vUakhbLZMzuRrOHWnTg/edit?ts=5c826bb4#heading=h.faqan9rjc6dm].
> We want to report job metrics back to the portability harness from the runner 
> harness, for displaying to users.
> h1. Relevant PRs in flight:
> h2. Ready for Review:
>  * [#8022|https://github.com/apache/beam/pull/8022]: correct the Job RPC 
> protos from [#8018|https://github.com/apache/beam/pull/8018].
> h2. Iterating / Discussing:
>  * [#7971|https://github.com/apache/beam/pull/7971]: Flink portable metrics: 
> get ptransform from MonitoringInfo, not stage name
>  ** this is a simpler, Flink-specific PR that is basically duplicated inside 
> each of the following two, so may be worth trying to merge in first
>  * #[7915|https://github.com/apache/beam/pull/7915]: use MonitoringInfo data 
> model in Java SDK metrics
>  * [#7868|https://github.com/apache/beam/pull/7868]: MonitoringInfo URN tweaks
> h2. Merged
>  * [#8018|https://github.com/apache/beam/pull/8018]: add job metrics RPC 
> protos
>  * [#7867|https://github.com/apache/beam/pull/7867]: key MetricResult by a 
> MetricKey
>  * [#7938|https://github.com/apache/beam/pull/7938]: move MonitoringInfo 
> protos to model/pipeline module
>  * [#7883|https://github.com/apache/beam/pull/7883]: Add 
> MetricQueryResults.allMetrics() helper
>  * [#7866|https://github.com/apache/beam/pull/7866]: move function helpers 
> from fn-harness to sdks/java/core
>  * [#7890|https://github.com/apache/beam/pull/7890]: consolidate MetricResult 
> implementations
> h2. Closed
>  * [#7934|https://github.com/apache/beam/pull/7934]: job metrics RPC + SDK 
> support
>  * [#7876|https://github.com/apache/beam/pull/7876]: Clean up metric protos; 
> support integer distributions, gauges



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-10-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=332791=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332791
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 23/Oct/19 18:45
Start Date: 23/Oct/19 18:45
Worklog Time Spent: 10m 
  Work Description: pabloem commented on pull request #9772: [BEAM-1440] 
Create a BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#discussion_r337764174
 
 

 ##
 File path: sdks/python/apache_beam/io/gcp/bigquery_tools.py
 ##
 @@ -695,10 +769,12 @@ def get_or_create_table(
 
   def run_query(self, project_id, query, use_legacy_sql, flatten_results,
 dry_run=False):
-job_id, location = self._start_query_job(project_id, query,
- use_legacy_sql, flatten_results,
- job_id=uuid.uuid4().hex,
- dry_run=dry_run)
+job = self._start_query_job(project_id, query, use_legacy_sql,
 
 Review comment:
   Did this API change? Did it return a job_id, location and now it returns the 
whole job object?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 332791)
Time Spent: 4.5h  (was: 4h 20m)

> Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
> --
>
> Key: BEAM-1440
> URL: https://issues.apache.org/jira/browse/BEAM-1440
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> Currently we have a BigQuery native source for Python SDK [1].
> This can only be used by Dataflow runner.
> We should  implement a Beam BigQuery source that implements 
> iobase.BoundedSource [2] interface so that other runners that try to use 
> Python SDK can read from BigQuery as well. Java SDK already has a Beam 
> BigQuery source [3].
> [1] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py
> [2] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70
> [3] 
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK

2019-10-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=332788=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332788
 ]

ASF GitHub Bot logged work on BEAM-1440:


Author: ASF GitHub Bot
Created on: 23/Oct/19 18:45
Start Date: 23/Oct/19 18:45
Worklog Time Spent: 10m 
  Work Description: pabloem commented on pull request #9772: [BEAM-1440] 
Create a BigQuery source that implements iobase.BoundedSource for Python
URL: https://github.com/apache/beam/pull/9772#discussion_r337760626
 
 

 ##
 File path: sdks/python/test-suites/portable/py37/build.gradle
 ##
 @@ -30,3 +33,25 @@ task preCommitPy37() {
 dependsOn portableWordCountBatch
 dependsOn portableWordCountStreaming
 }
+
+task postCommitIT {
+  dependsOn 'installGcpTest'
+  dependsOn 'setupVirtualenv'
+  dependsOn ':runners:flink:1.8:job-server:shadowJar'
+
+  doLast {
+def tests = [
+"apache_beam.io.gcp.bigquery_read_it_test",
+]
+def testOpts = ["--tests=${tests.join(',')}"]
+def cmdArgs = mapToArgString([
+"test_opts": testOpts,
+"suite": "postCommitIT-flink-py37",
+"pipeline_opts": "--runner=FlinkRunner --project=apache-beam-testing 
--environment_type=LOOPBACK",
 
 Review comment:
   Have you ran this test with Flink?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 332788)
Time Spent: 4h 20m  (was: 4h 10m)

> Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
> --
>
> Key: BEAM-1440
> URL: https://issues.apache.org/jira/browse/BEAM-1440
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Kamil Wasilewski
>Priority: Major
>  Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
> Currently we have a BigQuery native source for Python SDK [1].
> This can only be used by Dataflow runner.
> We should  implement a Beam BigQuery source that implements 
> iobase.BoundedSource [2] interface so that other runners that try to use 
> Python SDK can read from BigQuery as well. Java SDK already has a Beam 
> BigQuery source [3].
> [1] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py
> [2] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70
> [3] 
> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-7926) Visualize PCollection with Interactive Beam

2019-10-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7926?focusedWorklogId=332787=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332787
 ]

ASF GitHub Bot logged work on BEAM-7926:


Author: ASF GitHub Bot
Created on: 23/Oct/19 18:44
Start Date: 23/Oct/19 18:44
Worklog Time Spent: 10m 
  Work Description: KevinGG commented on pull request #9741: [BEAM-7926] 
Visualize PCollection
URL: https://github.com/apache/beam/pull/9741#discussion_r338217168
 
 

 ##
 File path: 
sdks/python/apache_beam/runners/interactive/display/pcoll_visualization_test.py
 ##
 @@ -0,0 +1,152 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Tests for apache_beam.runners.interactive.display.pcoll_visualization."""
+from __future__ import absolute_import
+
+import sys
+import time
+import unittest
+
+import apache_beam as beam  # pylint: disable=ungrouped-imports
+import timeloop
+from apache_beam.runners import runner
+from apache_beam.runners.interactive import interactive_environment as ie
+from apache_beam.runners.interactive.display import pcoll_visualization as pv
+
+# Work around nose tests using Python2 without unittest.mock module.
+try:
+  from unittest.mock import patch
+except ImportError:
+  from mock import patch
+
+
+class PCollVisualizationTest(unittest.TestCase):
+
+  def setUp(self):
+self._p = beam.Pipeline()
+# pylint: disable=range-builtin-not-iterating
+self._pcoll = self._p | 'Create' >> beam.Create(range(1000))
+
+  @unittest.skipIf(sys.version_info < (3, 5, 3),
+   'PCollVisualization is not supported on Python 2.')
+  def test_raise_error_for_non_pcoll_input(self):
+class Foo(object):
+  pass
+
+with self.assertRaises(AssertionError) as ctx:
+  pv.PCollVisualization(Foo())
+  self.assertTrue('pcoll should be apache_beam.pvalue.PCollection' in
+  ctx.exception)
+
+  @unittest.skipIf(sys.version_info < (3, 5, 3),
+   'PCollVisualization is not supported on Python 2.')
+  def test_pcoll_visualization_generate_unique_display_id(self):
+pv_1 = pv.PCollVisualization(self._pcoll)
+pv_2 = pv.PCollVisualization(self._pcoll)
+self.assertNotEqual(pv_1._dive_display_id, pv_2._dive_display_id)
+self.assertNotEqual(pv_1._overview_display_id, pv_2._overview_display_id)
+self.assertNotEqual(pv_1._df_display_id, pv_2._df_display_id)
+
+  @unittest.skipIf(sys.version_info < (3, 5, 3),
+   'PCollVisualization is not supported on Python 2.')
+  @patch('apache_beam.runners.interactive.display.pcoll_visualization'
+ '.PCollVisualization._to_element_list', lambda x: [1, 2, 3])
+  def test_one_shot_visualization_not_return_handle(self):
+self.assertIsNone(pv.visualize(self._pcoll))
+
+  def _mock_to_element_list(self):
+yield [1, 2, 3]
+yield [1, 2, 3, 4]
+yield [1, 2, 3, 4, 5]
+yield [1, 2, 3, 4, 5, 6]
+yield [1, 2, 3, 4, 5, 6, 7]
+yield [1, 2, 3, 4, 5, 6, 7, 8]
+
+  @unittest.skipIf(sys.version_info < (3, 5, 3),
+   'PCollVisualization is not supported on Python 2.')
+  @patch('apache_beam.runners.interactive.display.pcoll_visualization'
+ '.PCollVisualization._to_element_list', _mock_to_element_list)
+  def test_dynamic_plotting_return_handle(self):
+h = pv.visualize(self._pcoll, dynamic_plotting_interval=1)
+self.assertIsInstance(h, timeloop.Timeloop)
+h.stop()
+
+  @unittest.skipIf(sys.version_info < (3, 5, 3),
+   'PCollVisualization is not supported on Python 2.')
+  @patch('apache_beam.runners.interactive.display.pcoll_visualization'
+ '.PCollVisualization._to_element_list', _mock_to_element_list)
+  @patch('apache_beam.runners.interactive.display.pcoll_visualization'
+ '.PCollVisualization.display_facets')
+  def test_dynamic_plotting_update_same_display(self,
+mocked_display_facets):
+fake_pipeline_result = runner.PipelineResult(runner.PipelineState.RUNNING)
+ie.current_env().set_pipeline_result(self._p, fake_pipeline_result)
+# Starts async dynamic plotting that never 

[jira] [Work logged] (BEAM-7926) Visualize PCollection with Interactive Beam

2019-10-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7926?focusedWorklogId=332777=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332777
 ]

ASF GitHub Bot logged work on BEAM-7926:


Author: ASF GitHub Bot
Created on: 23/Oct/19 18:37
Start Date: 23/Oct/19 18:37
Worklog Time Spent: 10m 
  Work Description: KevinGG commented on pull request #9741: [BEAM-7926] 
Visualize PCollection
URL: https://github.com/apache/beam/pull/9741#discussion_r338213864
 
 

 ##
 File path: 
sdks/python/apache_beam/runners/interactive/display/pcoll_visualization.py
 ##
 @@ -0,0 +1,279 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module visualizes PCollection data.
+
+For internal use only; no backwards-compatibility guarantees.
+Only works with Python 3.5+.
+"""
+from __future__ import absolute_import
+
+import base64
+import logging
+from datetime import timedelta
+
+from pandas.io.json import json_normalize
+
+from apache_beam import pvalue
+from apache_beam.runners.interactive import interactive_environment as ie
+from apache_beam.runners.interactive import pipeline_instrument as instr
+
+# jsons doesn't support < Python 3.5. Work around with json for legacy tests.
+# TODO(BEAM-8288): clean up once Py2 is deprecated from Beam.
+try:
+  import jsons
+  _pv_jsons_load = jsons.load
+  _pv_jsons_dump = jsons.dump
+except ImportError:
+  import json
+  _pv_jsons_load = json.load
+  _pv_jsons_dump = json.dump
+
+try:
+  from facets_overview.generic_feature_statistics_generator import 
GenericFeatureStatisticsGenerator
+  _facets_gfsg_ready = True
+except ImportError:
+  _facets_gfsg_ready = False
+
+try:
+  from IPython.core.display import HTML
+  from IPython.core.display import Javascript
+  from IPython.core.display import display
+  from IPython.core.display import display_javascript
+  from IPython.core.display import update_display
+  _ipython_ready = True
+except ImportError:
+  _ipython_ready = False
+
+try:
+  from timeloop import Timeloop
+  _tl_ready = True
+except ImportError:
+  _tl_ready = False
+
+# 1-d types that need additional normalization to be compatible with DataFrame.
+_one_dimension_types = (int, float, str, bool, list, tuple)
+
+_DIVE_SCRIPT_TEMPLATE = """
+document.querySelector("#{display_id}").data = {jsonstr};"""
+_DIVE_HTML_TEMPLATE = """
+https://cdnjs.cloudflare.com/ajax/libs/webcomponentsjs/1.3.3/webcomponents-lite.js";>
+https://raw.githubusercontent.com/PAIR-code/facets/1.0.0/facets-dist/facets-jupyter.html;>
+
+
+  document.querySelector("#{display_id}").data = {jsonstr};
+"""
+_OVERVIEW_SCRIPT_TEMPLATE = """
+  document.querySelector("#{display_id}").protoInput = 
"{protostr}";
+  """
+_OVERVIEW_HTML_TEMPLATE = """
+https://cdnjs.cloudflare.com/ajax/libs/webcomponentsjs/1.3.3/webcomponents-lite.js";>
+https://raw.githubusercontent.com/PAIR-code/facets/1.0.0/facets-dist/facets-jupyter.html;>
+
+
+  document.querySelector("#{display_id}").protoInput = 
"{protostr}";
+"""
+_DATAFRAME_PAGINATION_TEMPLATE = """
+https://ajax.googleapis.com/ajax/libs/jquery/2.2.2/jquery.min.js";>
 
+https://cdn.datatables.net/1.10.16/js/jquery.dataTables.js";> 
+https://cdn.datatables.net/1.10.16/css/jquery.dataTables.css;>
+{dataframe_html}
+
+  $("#{table_id}").DataTable();
+"""
+
+
+def visualize(pcoll, dynamic_plotting_interval=None):
+  """Visualizes the data of a given PCollection. Optionally enables dynamic
+  plotting with interval in seconds if the PCollection is being produced by a
+  running pipeline or the pipeline is streaming indefinitely. The function
+  always returns immediately and is asynchronous when dynamic plotting is on.
+
+  If dynamic plotting enabled, the visualization is updated continuously until
+  the pipeline producing the PCollection is in an end state. The visualization
+  would be anchored to the notebook cell output 

[jira] [Work logged] (BEAM-7926) Visualize PCollection with Interactive Beam

2019-10-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7926?focusedWorklogId=332775=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332775
 ]

ASF GitHub Bot logged work on BEAM-7926:


Author: ASF GitHub Bot
Created on: 23/Oct/19 18:32
Start Date: 23/Oct/19 18:32
Worklog Time Spent: 10m 
  Work Description: KevinGG commented on pull request #9741: [BEAM-7926] 
Visualize PCollection
URL: https://github.com/apache/beam/pull/9741#discussion_r338210890
 
 

 ##
 File path: 
sdks/python/apache_beam/runners/interactive/display/pcoll_visualization.py
 ##
 @@ -0,0 +1,279 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module visualizes PCollection data.
+
+For internal use only; no backwards-compatibility guarantees.
+Only works with Python 3.5+.
+"""
+from __future__ import absolute_import
+
+import base64
+import logging
+from datetime import timedelta
+
+from pandas.io.json import json_normalize
+
+from apache_beam import pvalue
+from apache_beam.runners.interactive import interactive_environment as ie
+from apache_beam.runners.interactive import pipeline_instrument as instr
+
+# jsons doesn't support < Python 3.5. Work around with json for legacy tests.
+# TODO(BEAM-8288): clean up once Py2 is deprecated from Beam.
+try:
+  import jsons
+  _pv_jsons_load = jsons.load
+  _pv_jsons_dump = jsons.dump
+except ImportError:
+  import json
+  _pv_jsons_load = json.load
+  _pv_jsons_dump = json.dump
+
+try:
+  from facets_overview.generic_feature_statistics_generator import 
GenericFeatureStatisticsGenerator
+  _facets_gfsg_ready = True
+except ImportError:
+  _facets_gfsg_ready = False
+
+try:
+  from IPython.core.display import HTML
+  from IPython.core.display import Javascript
+  from IPython.core.display import display
+  from IPython.core.display import display_javascript
+  from IPython.core.display import update_display
+  _ipython_ready = True
+except ImportError:
+  _ipython_ready = False
+
+try:
+  from timeloop import Timeloop
+  _tl_ready = True
+except ImportError:
+  _tl_ready = False
+
+# 1-d types that need additional normalization to be compatible with DataFrame.
+_one_dimension_types = (int, float, str, bool, list, tuple)
+
+_DIVE_SCRIPT_TEMPLATE = """
+document.querySelector("#{display_id}").data = {jsonstr};"""
+_DIVE_HTML_TEMPLATE = """
+https://cdnjs.cloudflare.com/ajax/libs/webcomponentsjs/1.3.3/webcomponents-lite.js";>
 
 Review comment:
   The `facets` templates originate from examples in notebooks shared among the 
OSS community.
   The `jquery` template simply uses a jquery DataTable widget.
   For unittest, the logic of each template is too simple that there is nothing 
really to be tested.
   The logic I added to set display_id and update existing widgets have been 
unittested.
   We'll have some integration test later to execute notebook and verify output 
area. Such integration test framework is being developed by some other team. 
We'll merge our efforts there.
   For manual test, I've manually tested them from notebook. I can attach png 
and gif if needed.
   
   Off the topic:
   In future PRs, we might have `%%javascript` magic involved in the python 
code. Then we can have some library such as `js2py` for javascript unittests in 
Python.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 332775)
Time Spent: 11h 40m  (was: 11.5h)

> Visualize PCollection with Interactive Beam
> ---
>
> Key: BEAM-7926
> URL: https://issues.apache.org/jira/browse/BEAM-7926
> Project: Beam
>  Issue Type: New Feature
>  Components: runner-py-interactive
>Reporter: Ning Kang
>Assignee: Ning Kang
>Priority: Major
>  Time 

[jira] [Work logged] (BEAM-7926) Visualize PCollection with Interactive Beam

2019-10-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7926?focusedWorklogId=332774=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332774
 ]

ASF GitHub Bot logged work on BEAM-7926:


Author: ASF GitHub Bot
Created on: 23/Oct/19 18:32
Start Date: 23/Oct/19 18:32
Worklog Time Spent: 10m 
  Work Description: KevinGG commented on pull request #9741: [BEAM-7926] 
Visualize PCollection
URL: https://github.com/apache/beam/pull/9741#discussion_r338210890
 
 

 ##
 File path: 
sdks/python/apache_beam/runners/interactive/display/pcoll_visualization.py
 ##
 @@ -0,0 +1,279 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module visualizes PCollection data.
+
+For internal use only; no backwards-compatibility guarantees.
+Only works with Python 3.5+.
+"""
+from __future__ import absolute_import
+
+import base64
+import logging
+from datetime import timedelta
+
+from pandas.io.json import json_normalize
+
+from apache_beam import pvalue
+from apache_beam.runners.interactive import interactive_environment as ie
+from apache_beam.runners.interactive import pipeline_instrument as instr
+
+# jsons doesn't support < Python 3.5. Work around with json for legacy tests.
+# TODO(BEAM-8288): clean up once Py2 is deprecated from Beam.
+try:
+  import jsons
+  _pv_jsons_load = jsons.load
+  _pv_jsons_dump = jsons.dump
+except ImportError:
+  import json
+  _pv_jsons_load = json.load
+  _pv_jsons_dump = json.dump
+
+try:
+  from facets_overview.generic_feature_statistics_generator import 
GenericFeatureStatisticsGenerator
+  _facets_gfsg_ready = True
+except ImportError:
+  _facets_gfsg_ready = False
+
+try:
+  from IPython.core.display import HTML
+  from IPython.core.display import Javascript
+  from IPython.core.display import display
+  from IPython.core.display import display_javascript
+  from IPython.core.display import update_display
+  _ipython_ready = True
+except ImportError:
+  _ipython_ready = False
+
+try:
+  from timeloop import Timeloop
+  _tl_ready = True
+except ImportError:
+  _tl_ready = False
+
+# 1-d types that need additional normalization to be compatible with DataFrame.
+_one_dimension_types = (int, float, str, bool, list, tuple)
+
+_DIVE_SCRIPT_TEMPLATE = """
+document.querySelector("#{display_id}").data = {jsonstr};"""
+_DIVE_HTML_TEMPLATE = """
+https://cdnjs.cloudflare.com/ajax/libs/webcomponentsjs/1.3.3/webcomponents-lite.js";>
 
 Review comment:
   The `facets` templates originate from examples in notebooks shared among the 
OSS community.
   The `jquery` template simply uses a jquery DataTable widget.
   For unittest, the logic of each template is too simple that there is nothing 
really to be tested.
   The logic I've added to set display_id and update existing widgets have been 
unittested.
   We'll have some integration test later to execute notebook and verify output 
area. Such integration test framework is being developed by some other team. 
We'll merge our efforts there.
   For manual test, I've manually tested them from notebook. I can attach png 
and gif if needed.
   
   Off the topic:
   In future PRs, we might have `%%javascript` magic involved in the python 
code. Then we can have some library such as `js2py` for javascript unittests in 
Python.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 332774)
Time Spent: 11.5h  (was: 11h 20m)

> Visualize PCollection with Interactive Beam
> ---
>
> Key: BEAM-7926
> URL: https://issues.apache.org/jira/browse/BEAM-7926
> Project: Beam
>  Issue Type: New Feature
>  Components: runner-py-interactive
>Reporter: Ning Kang
>Assignee: Ning Kang
>Priority: Major
>  Time 

[jira] [Comment Edited] (BEAM-8397) DataflowRunnerTest.test_remote_runner_display_data fails due to infinite recursion during pickling.

2019-10-23 Thread Valentyn Tymofieiev (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-8397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958087#comment-16958087
 ] 

Valentyn Tymofieiev edited comment on BEAM-8397 at 10/23/19 6:31 PM:
-

It appears that test_remote_runner_display_data triggers infinite recursion in 
_find_containing_class / _find_containing_class_inner methods.

This recursion is present in Python 2 and all supported Python 3 versions, 
however Python is not reliably detecting this recursion, and so the recursion 
causes crashes and/or goes unnoticed depending on Python version and 
sys.getrecursionlimit(). To observe that there is a recursion problem, we can 
add some logging into _find_containing_class_inner. With changes in [1] we have:
{noformat}
python ./setup.py nosetests --nocapture --tests 
'apache_beam/runners/dataflow/dataflow_runner_test.py:DataflowRunnerTest.test_remote_runner_display_data'
  

==
FAIL: test_remote_runner_display_data 
(apache_beam.runners.dataflow.dataflow_runner_test.DataflowRunnerTest)
--
Traceback (most recent call last):
  File 
"/usr/local/google/home/valentyn/projects/beam/beam3/beam/sdks/python/apache_beam/runners/dataflow/dataflow_runner_test.py",
 line 283, in test_remote_runner_display_data
self.assertEqual(len(disp_data), 4)
AssertionError: 3 != 4  
 (This is intentional, otherwise the test passes)
 >> begin captured logging << 
...
root: WARNING: _find_containing_class_inner called 5312 times
root: WARNING: _find_containing_class_inner called 5313 times
root: WARNING: _find_containing_class_inner called 5314 times   
 (This correlates with the increased number of recursion level set to 
 in [1])
- >> end captured logging << 
...
{noformat}
I reproduced this further in the following counterexample:
{noformat}
from apache_beam.internal import pickler




 

   
class Outter(): 
   
  def method(self): 
   
class InnerA(object):   
   
  def __init__(self):   
   
pass
   

   
class InnerB(InnerA):   
   
  def __init__(self):   
   
super(InnerB, self).__init__()  
   

   
o = InnerB()
   
pickler.loads(pickler.dumps(o)) 
   

   
c = Outter()
   
c.method()
{noformat}
Running the above via python -m test fails (on Py2 and Py3.x) with:
{noformat}
RuntimeError: maximum recursion depth exceeded while getting the str of an 
object
{noformat}
I think the long-term course of action here should be removing patches[2] of 
dill (contributing the patches upstream, or switching to a different pickler: 
BEAM-8123).

In short term, to unblock BEAM-5878 we can remove nested inner classes from 
test_remote_runner_display_data and proceed with planned dill upgrade to 
0.3.1.1. As noted above, the upgrade itself does not trigger the test failure, 
as the infinite recursion is already happening but for some reason in most 
cases the test does not fail. It would be nice also to file an issue against 
CPython to improve detection of infinite recursion, with a repro they can 
follow.

cc: [~robertwb] [~altay] [~udim] FYI.

 

[1] [https://github.com/apache/beam/pull/9859]

[2] 
https://github.com/apache/beam/blob/eb40876109a557586c33b77923025508712184ab/sdks/python/apache_beam/internal/pickler.py#L126


was (Author: tvalentyn):
It appears that 

[jira] [Work logged] (BEAM-7926) Visualize PCollection with Interactive Beam

2019-10-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7926?focusedWorklogId=332772=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332772
 ]

ASF GitHub Bot logged work on BEAM-7926:


Author: ASF GitHub Bot
Created on: 23/Oct/19 18:31
Start Date: 23/Oct/19 18:31
Worklog Time Spent: 10m 
  Work Description: KevinGG commented on pull request #9741: [BEAM-7926] 
Visualize PCollection
URL: https://github.com/apache/beam/pull/9741#discussion_r338210890
 
 

 ##
 File path: 
sdks/python/apache_beam/runners/interactive/display/pcoll_visualization.py
 ##
 @@ -0,0 +1,279 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Module visualizes PCollection data.
+
+For internal use only; no backwards-compatibility guarantees.
+Only works with Python 3.5+.
+"""
+from __future__ import absolute_import
+
+import base64
+import logging
+from datetime import timedelta
+
+from pandas.io.json import json_normalize
+
+from apache_beam import pvalue
+from apache_beam.runners.interactive import interactive_environment as ie
+from apache_beam.runners.interactive import pipeline_instrument as instr
+
+# jsons doesn't support < Python 3.5. Work around with json for legacy tests.
+# TODO(BEAM-8288): clean up once Py2 is deprecated from Beam.
+try:
+  import jsons
+  _pv_jsons_load = jsons.load
+  _pv_jsons_dump = jsons.dump
+except ImportError:
+  import json
+  _pv_jsons_load = json.load
+  _pv_jsons_dump = json.dump
+
+try:
+  from facets_overview.generic_feature_statistics_generator import 
GenericFeatureStatisticsGenerator
+  _facets_gfsg_ready = True
+except ImportError:
+  _facets_gfsg_ready = False
+
+try:
+  from IPython.core.display import HTML
+  from IPython.core.display import Javascript
+  from IPython.core.display import display
+  from IPython.core.display import display_javascript
+  from IPython.core.display import update_display
+  _ipython_ready = True
+except ImportError:
+  _ipython_ready = False
+
+try:
+  from timeloop import Timeloop
+  _tl_ready = True
+except ImportError:
+  _tl_ready = False
+
+# 1-d types that need additional normalization to be compatible with DataFrame.
+_one_dimension_types = (int, float, str, bool, list, tuple)
+
+_DIVE_SCRIPT_TEMPLATE = """
+document.querySelector("#{display_id}").data = {jsonstr};"""
+_DIVE_HTML_TEMPLATE = """
+https://cdnjs.cloudflare.com/ajax/libs/webcomponentsjs/1.3.3/webcomponents-lite.js";>
 
 Review comment:
   The `facets` templates originate from examples in notebooks shared among the 
OSS community.
   The `jquery` template simply uses a jquery DataTable widget.
   For unittest, the logic of each template is too simple that there is nothing 
really to be tested.
   We'll have some integration test later to execute notebook and verify output 
area. Such integration test framework is being developed by some other team. 
We'll merge our efforts there.
   For manual test, I've manually tested them from notebook. I can attach png 
and gif if needed.
   
   Off the topic:
   In future PRs, we might have `%%javascript` magic involved in the python 
code. Then we can have some library such as `js2py` for javascript unittests in 
Python.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 332772)
Time Spent: 11h 20m  (was: 11h 10m)

> Visualize PCollection with Interactive Beam
> ---
>
> Key: BEAM-7926
> URL: https://issues.apache.org/jira/browse/BEAM-7926
> Project: Beam
>  Issue Type: New Feature
>  Components: runner-py-interactive
>Reporter: Ning Kang
>Assignee: Ning Kang
>Priority: Major
>  Time Spent: 11h 20m
>  Remaining Estimate: 0h
>
> Support auto plotting / charting of 

[jira] [Work logged] (BEAM-8467) Enable reading compressed files with Python fileio

2019-10-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8467?focusedWorklogId=332771=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332771
 ]

ASF GitHub Bot logged work on BEAM-8467:


Author: ASF GitHub Bot
Created on: 23/Oct/19 18:31
Start Date: 23/Oct/19 18:31
Worklog Time Spent: 10m 
  Work Description: pabloem commented on pull request #9861: [BEAM-8467] 
Enabling reading compressed files
URL: https://github.com/apache/beam/pull/9861
 
 
   r: @chamikaramj 
   
   
   Thank you for your contribution! Follow this checklist to help us 
incorporate your contribution quickly and easily:
   
- [ ] [**Choose 
reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and 
mention them in a comment (`R: @username`).
- [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in 
ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA 
issue, if applicable. This will automatically link the pull request to the 
issue.
- [ ] If this contribution is large, please file an Apache [Individual 
Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   See the [Contributor Guide](https://beam.apache.org/contribute) for more 
tips on [how to make review process 
smoother](https://beam.apache.org/contribute/#make-reviewers-job-easier).
   
   Post-Commit Tests Status (on master branch)
   

   
   Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark
   --- | --- | --- | --- | --- | --- | --- | ---
   Go | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/)
 | --- | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/)
 | --- | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/)
   Java | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/)
   Python | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/)[![Build

[jira] [Created] (BEAM-8467) Enable reading compressed files with Python fileio

2019-10-23 Thread Pablo Estrada (Jira)
Pablo Estrada created BEAM-8467:
---

 Summary: Enable reading compressed files with Python fileio
 Key: BEAM-8467
 URL: https://issues.apache.org/jira/browse/BEAM-8467
 Project: Beam
  Issue Type: Improvement
  Components: io-py-files
Reporter: Pablo Estrada
Assignee: Pablo Estrada






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-8467) Enable reading compressed files with Python fileio

2019-10-23 Thread Pablo Estrada (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pablo Estrada updated BEAM-8467:

Status: Open  (was: Triage Needed)

> Enable reading compressed files with Python fileio
> --
>
> Key: BEAM-8467
> URL: https://issues.apache.org/jira/browse/BEAM-8467
> Project: Beam
>  Issue Type: Improvement
>  Components: io-py-files
>Reporter: Pablo Estrada
>Assignee: Pablo Estrada
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-3713) Consider moving away from nose to nose2 or pytest.

2019-10-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-3713?focusedWorklogId=332769=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332769
 ]

ASF GitHub Bot logged work on BEAM-3713:


Author: ASF GitHub Bot
Created on: 23/Oct/19 18:29
Start Date: 23/Oct/19 18:29
Worklog Time Spent: 10m 
  Work Description: tvalentyn commented on pull request #9756: [BEAM-3713] 
Add pytest for unit tests
URL: https://github.com/apache/beam/pull/9756#discussion_r338208964
 
 

 ##
 File path: sdks/python/apache_beam/runners/dataflow/dataflow_runner_test.py
 ##
 @@ -228,6 +229,8 @@ def test_biqquery_read_streaming_fail(self):
  r'source is not currently available'):
   p.run()
 
+  @pytest.mark.skipif(sys.version_info >= (3, 7),
+  reason='TODO(BEAM-8095): Segfaults in Python 3.7')
 
 Review comment:
   Regarding problems with this test, see: 
https://issues.apache.org/jira/browse/BEAM-8397?focusedCommentId=16958087=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16958087
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 332769)
Time Spent: 10h 40m  (was: 10.5h)

> Consider moving away from nose to nose2 or pytest.
> --
>
> Key: BEAM-3713
> URL: https://issues.apache.org/jira/browse/BEAM-3713
> Project: Beam
>  Issue Type: Test
>  Components: sdk-py-core, testing
>Reporter: Robert Bradshaw
>Assignee: Udi Meiri
>Priority: Minor
>  Time Spent: 10h 40m
>  Remaining Estimate: 0h
>
> Per 
> [https://nose.readthedocs.io/en/latest/|https://nose.readthedocs.io/en/latest/,]
>  , nose is in maintenance mode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-3713) Consider moving away from nose to nose2 or pytest.

2019-10-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-3713?focusedWorklogId=332768=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332768
 ]

ASF GitHub Bot logged work on BEAM-3713:


Author: ASF GitHub Bot
Created on: 23/Oct/19 18:29
Start Date: 23/Oct/19 18:29
Worklog Time Spent: 10m 
  Work Description: tvalentyn commented on pull request #9756: [BEAM-3713] 
Add pytest for unit tests
URL: https://github.com/apache/beam/pull/9756#discussion_r338208964
 
 

 ##
 File path: sdks/python/apache_beam/runners/dataflow/dataflow_runner_test.py
 ##
 @@ -228,6 +229,8 @@ def test_biqquery_read_streaming_fail(self):
  r'source is not currently available'):
   p.run()
 
+  @pytest.mark.skipif(sys.version_info >= (3, 7),
+  reason='TODO(BEAM-8095): Segfaults in Python 3.7')
 
 Review comment:
   Regarding problems with this test see: 
https://issues.apache.org/jira/browse/BEAM-8397?focusedCommentId=16958087=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16958087
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 332768)
Time Spent: 10.5h  (was: 10h 20m)

> Consider moving away from nose to nose2 or pytest.
> --
>
> Key: BEAM-3713
> URL: https://issues.apache.org/jira/browse/BEAM-3713
> Project: Beam
>  Issue Type: Test
>  Components: sdk-py-core, testing
>Reporter: Robert Bradshaw
>Assignee: Udi Meiri
>Priority: Minor
>  Time Spent: 10.5h
>  Remaining Estimate: 0h
>
> Per 
> [https://nose.readthedocs.io/en/latest/|https://nose.readthedocs.io/en/latest/,]
>  , nose is in maintenance mode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-8457) Instrument Dataflow jobs that are launched from Notebooks

2019-10-23 Thread Pablo Estrada (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958147#comment-16958147
 ] 

Pablo Estrada commented on BEAM-8457:
-

yes, this change is for the current release.

> Instrument Dataflow jobs that are launched from Notebooks
> -
>
> Key: BEAM-8457
> URL: https://issues.apache.org/jira/browse/BEAM-8457
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-py-interactive
>Reporter: Ning Kang
>Assignee: Ning Kang
>Priority: Major
> Fix For: 2.17.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Dataflow needs the capability to tell how many Dataflow jobs are launched 
> from the Notebook environment, i.e., the Interactive Runner.
>  # Change the pipeline.run() API to allow supply a runner and an option 
> parameter so that a pipeline initially bundled w/ an interactive runner can 
> be directly run by other runners from notebook.
>  # Implicitly add the necessary source information through user labels when 
> the user does p.run(runner=DataflowRunner()).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (BEAM-8367) Python BigQuery sink should use unique IDs for mode STREAMING_INSERTS

2019-10-23 Thread Pablo Estrada (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pablo Estrada resolved BEAM-8367.
-
Resolution: Fixed

> Python BigQuery sink should use unique IDs for mode STREAMING_INSERTS
> -
>
> Key: BEAM-8367
> URL: https://issues.apache.org/jira/browse/BEAM-8367
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Pablo Estrada
>Priority: Blocker
> Fix For: 2.17.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> Unique IDs ensure (best effort) that writes to BigQuery are idempotent, for 
> example, we don't write the same record twice in a VM failure.
>  
> Currently Python BQ sink insert BQ IDs here but they'll be re-generated in a 
> VM failure resulting in data duplication.
> [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py#L766]
>  
> Correct fix is to do a Reshuffle to checkpoint unique IDs once they are 
> generated, similar to how Java BQ sink operates.
> [https://github.com/apache/beam/blob/dcf6ad301069e4d2cfaec5db6b178acb7bb67f49/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/StreamingWriteTables.java#L225]
>  
> Pablo, can you do an initial assessment here ?
> I think this is a relatively small fix but I might be wrong.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-8367) Python BigQuery sink should use unique IDs for mode STREAMING_INSERTS

2019-10-23 Thread Pablo Estrada (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-8367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958146#comment-16958146
 ] 

Pablo Estrada commented on BEAM-8367:
-

Yes. Fixed. Thanks!

> Python BigQuery sink should use unique IDs for mode STREAMING_INSERTS
> -
>
> Key: BEAM-8367
> URL: https://issues.apache.org/jira/browse/BEAM-8367
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Chamikara Madhusanka Jayalath
>Assignee: Pablo Estrada
>Priority: Blocker
> Fix For: 2.17.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> Unique IDs ensure (best effort) that writes to BigQuery are idempotent, for 
> example, we don't write the same record twice in a VM failure.
>  
> Currently Python BQ sink insert BQ IDs here but they'll be re-generated in a 
> VM failure resulting in data duplication.
> [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py#L766]
>  
> Correct fix is to do a Reshuffle to checkpoint unique IDs once they are 
> generated, similar to how Java BQ sink operates.
> [https://github.com/apache/beam/blob/dcf6ad301069e4d2cfaec5db6b178acb7bb67f49/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/StreamingWriteTables.java#L225]
>  
> Pablo, can you do an initial assessment here ?
> I think this is a relatively small fix but I might be wrong.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


  1   2   3   >