[jira] [Work logged] (BEAM-8344) Add infer schema support in ParquetIO and refactor ParquetTableProvider
[ https://issues.apache.org/jira/browse/BEAM-8344?focusedWorklogId=333070=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-333070 ] ASF GitHub Bot logged work on BEAM-8344: Author: ASF GitHub Bot Created on: 24/Oct/19 04:31 Start Date: 24/Oct/19 04:31 Worklog Time Spent: 10m Work Description: bmv126 commented on issue #9721: [BEAM-8344] Add inferSchema support in ParquetIO and refactor ParquetTableProvider URL: https://github.com/apache/beam/pull/9721#issuecomment-545739303 R: @amaliujia R: @reuvenlax R: @jbonofre This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 333070) Time Spent: 1h (was: 50m) > Add infer schema support in ParquetIO and refactor ParquetTableProvider > --- > > Key: BEAM-8344 > URL: https://issues.apache.org/jira/browse/BEAM-8344 > Project: Beam > Issue Type: Improvement > Components: dsl-sql, io-java-parquet >Reporter: Vishwas >Assignee: Vishwas >Priority: Major > Time Spent: 1h > Remaining Estimate: 0h > > Add support for inferring Beam Schema in ParquetIO. > Refactor ParquetTable code to use Convert.rows(). > Remove unnecessary java class GenericRecordReadConverter. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7746) Add type hints to python code
[ https://issues.apache.org/jira/browse/BEAM-7746?focusedWorklogId=333061=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-333061 ] ASF GitHub Bot logged work on BEAM-7746: Author: ASF GitHub Bot Created on: 24/Oct/19 04:14 Start Date: 24/Oct/19 04:14 Worklog Time Spent: 10m Work Description: chadrik commented on issue #9056: [BEAM-7746] Add python type hints URL: https://github.com/apache/beam/pull/9056#issuecomment-545735730 Ok, I have what I think are satisfactory answers to all of the problems that I encountered except for the 4 issues below that I need input on. @robertwb can you help me find the answers to these, please! --- ``` apache_beam/io/iobase.py:924: error: "SourceBase" has no attribute "coder" [attr-defined] ``` I can't find any sub-classes of `SourceBase` that have a coder attribute. Is this safe to remove or is this a dataflow thing? --- ``` apache_beam/runners/worker/statesampler_slow.py:77: error: "StateSampler" has no attribute "_states_by_name" [attr-defined] ``` `statesampler_slow.StateSampler` does not have `_states_by_name` attribute, but `statesampler.StateSampler` does. I could add this attribute to `statesampler_slow.StateSampler`, but I don't think it would be used. The more straightforward solution may be to edit `statesampler_slow.StateSampler.reset()` to do nothing. Right now I think it would error if it ran. --- ``` apache_beam/runners/portability/fn_api_runner_transforms.py:280: error: Invalid index type "Optional[str]" for "MutableMapping[str, Environment]"; expected type "str" [index] ``` It's unclear to me whether `Stage.environment` is meant to be `str` or `Optional[str]`. The way that it's initialized it _could_ be `Optional[str]`: ```python class Stage(object): """A set of Transforms that can be sent to the worker for processing.""" def __init__(self, name, # type: str transforms, # type: List[beam_runner_api_pb2.PTransform] downstream_side_inputs=None, # type: Optional[FrozenSet[str]] must_follow=frozenset(), # type: FrozenSet[Stage] parent=None, # type: Optional[Stage] environment=None, # type: Optional[str] forced_root=False ): ... if environment is None: environment = functools.reduce( self._merge_environments, (self._extract_environment(t) for t in transforms)) self.environment = environment ... @staticmethod def _extract_environment(transform): # type: (beam_runner_api_pb2.PTransform) -> Optional[str] if transform.spec.urn in PAR_DO_URNS: pardo_payload = proto_utils.parse_Bytes( transform.spec.payload, beam_runner_api_pb2.ParDoPayload) return pardo_payload.do_fn.environment_id elif transform.spec.urn in COMBINE_URNS: combine_payload = proto_utils.parse_Bytes( transform.spec.payload, beam_runner_api_pb2.CombinePayload) return combine_payload.combine_fn.environment_id else: return None ``` In practice will there will always be at least one ParDo or Combine per stage? If so we should be asserting that `self.environment is not None` in `Stage.__init__`. Alternately, we could assert that `self.environment is not None` just before this call in `executable_stage_transform`. The bottom line is that there are currently no guarantees in the code that `self.environment` is not None at this point, and if it is, it will be an error. --- ``` apache_beam/runners/portability/fn_api_runner.py:933: error: "ParallelBundleManager" has no attribute "_skip_registration" [attr-defined] ``` I can't find anywhere in the code that refers to `_skip_registration`. Is this safe to remove? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 333061) Time Spent: 9.5h (was: 9h 20m) > Add type hints to python code > - > > Key: BEAM-7746 > URL: https://issues.apache.org/jira/browse/BEAM-7746 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Chad Dombrova >Assignee: Chad Dombrova >Priority: Major > Time Spent: 9.5h > Remaining Estimate: 0h > > As a
[jira] [Work logged] (BEAM-7926) Visualize PCollection with Interactive Beam
[ https://issues.apache.org/jira/browse/BEAM-7926?focusedWorklogId=333026=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-333026 ] ASF GitHub Bot logged work on BEAM-7926: Author: ASF GitHub Bot Created on: 24/Oct/19 03:24 Start Date: 24/Oct/19 03:24 Worklog Time Spent: 10m Work Description: pabloem commented on pull request #9771: [BEAM-7926] Update dependencies in Java Katas URL: https://github.com/apache/beam/pull/9771 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 333026) Time Spent: 14h 10m (was: 14h) > Visualize PCollection with Interactive Beam > --- > > Key: BEAM-7926 > URL: https://issues.apache.org/jira/browse/BEAM-7926 > Project: Beam > Issue Type: New Feature > Components: runner-py-interactive >Reporter: Ning Kang >Assignee: Ning Kang >Priority: Major > Time Spent: 14h 10m > Remaining Estimate: 0h > > Support auto plotting / charting of materialized data of a given PCollection > with Interactive Beam. > Say an Interactive Beam pipeline defined as > p = create_pipeline() > pcoll = p | 'Transform' >> transform() > The use can call a single function and get auto-magical charting of the data > as materialized pcoll. > e.g., visualize(pcoll) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7926) Visualize PCollection with Interactive Beam
[ https://issues.apache.org/jira/browse/BEAM-7926?focusedWorklogId=333027=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-333027 ] ASF GitHub Bot logged work on BEAM-7926: Author: ASF GitHub Bot Created on: 24/Oct/19 03:24 Start Date: 24/Oct/19 03:24 Worklog Time Spent: 10m Work Description: pabloem commented on issue #9771: [BEAM-7926] Update dependencies in Java Katas URL: https://github.com/apache/beam/pull/9771#issuecomment-545725673 Thanks for the review @henryken and thanks a lot for the PR @leonardoam ! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 333027) Time Spent: 14h 20m (was: 14h 10m) > Visualize PCollection with Interactive Beam > --- > > Key: BEAM-7926 > URL: https://issues.apache.org/jira/browse/BEAM-7926 > Project: Beam > Issue Type: New Feature > Components: runner-py-interactive >Reporter: Ning Kang >Assignee: Ning Kang >Priority: Major > Time Spent: 14h 20m > Remaining Estimate: 0h > > Support auto plotting / charting of materialized data of a given PCollection > with Interactive Beam. > Say an Interactive Beam pipeline defined as > p = create_pipeline() > pcoll = p | 'Transform' >> transform() > The use can call a single function and get auto-magical charting of the data > as materialized pcoll. > e.g., visualize(pcoll) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7926) Visualize PCollection with Interactive Beam
[ https://issues.apache.org/jira/browse/BEAM-7926?focusedWorklogId=333023=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-333023 ] ASF GitHub Bot logged work on BEAM-7926: Author: ASF GitHub Bot Created on: 24/Oct/19 03:19 Start Date: 24/Oct/19 03:19 Worklog Time Spent: 10m Work Description: henryken commented on issue #9771: [BEAM-7926] Update dependencies in Java Katas URL: https://github.com/apache/beam/pull/9771#issuecomment-545724583 LGTM. I had a quick test and nothing breaks. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 333023) Time Spent: 14h (was: 13h 50m) > Visualize PCollection with Interactive Beam > --- > > Key: BEAM-7926 > URL: https://issues.apache.org/jira/browse/BEAM-7926 > Project: Beam > Issue Type: New Feature > Components: runner-py-interactive >Reporter: Ning Kang >Assignee: Ning Kang >Priority: Major > Time Spent: 14h > Remaining Estimate: 0h > > Support auto plotting / charting of materialized data of a given PCollection > with Interactive Beam. > Say an Interactive Beam pipeline defined as > p = create_pipeline() > pcoll = p | 'Transform' >> transform() > The use can call a single function and get auto-magical charting of the data > as materialized pcoll. > e.g., visualize(pcoll) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8446) apache_beam.io.gcp.bigquery_write_it_test.BigQueryWriteIntegrationTests.test_big_query_write_new_types is flaky
[ https://issues.apache.org/jira/browse/BEAM-8446?focusedWorklogId=333022=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-333022 ] ASF GitHub Bot logged work on BEAM-8446: Author: ASF GitHub Bot Created on: 24/Oct/19 03:16 Start Date: 24/Oct/19 03:16 Worklog Time Spent: 10m Work Description: pabloem commented on issue #9855: [BEAM-8446] Retrying BQ query on timeouts URL: https://github.com/apache/beam/pull/9855#issuecomment-545724097 Hm I added it because the Pydoc specifies it is a possibility. I did not reproduce it because the job is flaky... https://googleapis.dev/python/bigquery/latest/generated/google.cloud.bigquery.job.QueryJob.html#google.cloud.bigquery.job.QueryJob.result This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 333022) Time Spent: 3h 40m (was: 3.5h) > apache_beam.io.gcp.bigquery_write_it_test.BigQueryWriteIntegrationTests.test_big_query_write_new_types > is flaky > --- > > Key: BEAM-8446 > URL: https://issues.apache.org/jira/browse/BEAM-8446 > Project: Beam > Issue Type: New Feature > Components: test-failures >Reporter: Boyuan Zhang >Assignee: Pablo Estrada >Priority: Major > Time Spent: 3h 40m > Remaining Estimate: 0h > > test_big_query_write_new_types appears to be flaky in > beam_PostCommit_Python37 test suite. > https://builds.apache.org/job/beam_PostCommit_Python37/733/ > https://builds.apache.org/job/beam_PostCommit_Python37/739/ -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8388) Update Avro to 1.9.1 from 1.8.2
[ https://issues.apache.org/jira/browse/BEAM-8388?focusedWorklogId=333007=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-333007 ] ASF GitHub Bot logged work on BEAM-8388: Author: ASF GitHub Bot Created on: 24/Oct/19 02:33 Start Date: 24/Oct/19 02:33 Worklog Time Spent: 10m Work Description: kennknowles commented on issue #9779: [BEAM-8388] Updating avro dependency from 1.8.2 to 1.9.1 URL: https://github.com/apache/beam/pull/9779#issuecomment-545715194 Seems like including Avro in the core SDK might be a mistake / legacy problem. We could potentially create new artifacts for particular versions so users can choose compatible pieces of Beam. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 333007) Remaining Estimate: 21h 20m (was: 21.5h) Time Spent: 2h 40m (was: 2.5h) > Update Avro to 1.9.1 from 1.8.2 > --- > > Key: BEAM-8388 > URL: https://issues.apache.org/jira/browse/BEAM-8388 > Project: Beam > Issue Type: Improvement > Components: io-java-avro >Reporter: Jordanna Chord >Assignee: Jordanna Chord >Priority: Major > Original Estimate: 24h > Time Spent: 2h 40m > Remaining Estimate: 21h 20m > > Update build dependency to 1.9.1 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8446) apache_beam.io.gcp.bigquery_write_it_test.BigQueryWriteIntegrationTests.test_big_query_write_new_types is flaky
[ https://issues.apache.org/jira/browse/BEAM-8446?focusedWorklogId=332997=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332997 ] ASF GitHub Bot logged work on BEAM-8446: Author: ASF GitHub Bot Created on: 24/Oct/19 01:26 Start Date: 24/Oct/19 01:26 Worklog Time Spent: 10m Work Description: udim commented on issue #9855: [BEAM-8446] Retrying BQ query on timeouts URL: https://github.com/apache/beam/pull/9855#issuecomment-545701377 > Again timeout without failures: https://builds.apache.org/job/beam_PostCommit_Python37_PR/46/ I've tried looking at these with no luck. :( This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 332997) Time Spent: 3.5h (was: 3h 20m) > apache_beam.io.gcp.bigquery_write_it_test.BigQueryWriteIntegrationTests.test_big_query_write_new_types > is flaky > --- > > Key: BEAM-8446 > URL: https://issues.apache.org/jira/browse/BEAM-8446 > Project: Beam > Issue Type: New Feature > Components: test-failures >Reporter: Boyuan Zhang >Assignee: Pablo Estrada >Priority: Major > Time Spent: 3.5h > Remaining Estimate: 0h > > test_big_query_write_new_types appears to be flaky in > beam_PostCommit_Python37 test suite. > https://builds.apache.org/job/beam_PostCommit_Python37/733/ > https://builds.apache.org/job/beam_PostCommit_Python37/739/ -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8446) apache_beam.io.gcp.bigquery_write_it_test.BigQueryWriteIntegrationTests.test_big_query_write_new_types is flaky
[ https://issues.apache.org/jira/browse/BEAM-8446?focusedWorklogId=332996=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332996 ] ASF GitHub Bot logged work on BEAM-8446: Author: ASF GitHub Bot Created on: 24/Oct/19 01:23 Start Date: 24/Oct/19 01:23 Worklog Time Spent: 10m Work Description: udim commented on pull request #9855: [BEAM-8446] Retrying BQ query on timeouts URL: https://github.com/apache/beam/pull/9855#discussion_r338345852 ## File path: sdks/python/apache_beam/io/gcp/tests/bigquery_matcher.py ## @@ -46,9 +47,11 @@ MAX_RETRIES = 5 -def retry_on_http_and_value_error(exception): +def retry_on_http_timeout_and_value_error(exception): """Filter allowing retries on Bigquery errors and value error.""" - return isinstance(exception, (GoogleCloudError, ValueError)) + return isinstance(exception, (GoogleCloudError, +ValueError, +concurrent.futures.TimeoutError)) Review comment: Where is this `TimeoutError` being raised? I didn't see it in the failed tests' logs . This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 332996) Time Spent: 3h 20m (was: 3h 10m) > apache_beam.io.gcp.bigquery_write_it_test.BigQueryWriteIntegrationTests.test_big_query_write_new_types > is flaky > --- > > Key: BEAM-8446 > URL: https://issues.apache.org/jira/browse/BEAM-8446 > Project: Beam > Issue Type: New Feature > Components: test-failures >Reporter: Boyuan Zhang >Assignee: Pablo Estrada >Priority: Major > Time Spent: 3h 20m > Remaining Estimate: 0h > > test_big_query_write_new_types appears to be flaky in > beam_PostCommit_Python37 test suite. > https://builds.apache.org/job/beam_PostCommit_Python37/733/ > https://builds.apache.org/job/beam_PostCommit_Python37/739/ -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8446) apache_beam.io.gcp.bigquery_write_it_test.BigQueryWriteIntegrationTests.test_big_query_write_new_types is flaky
[ https://issues.apache.org/jira/browse/BEAM-8446?focusedWorklogId=332994=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332994 ] ASF GitHub Bot logged work on BEAM-8446: Author: ASF GitHub Bot Created on: 24/Oct/19 01:21 Start Date: 24/Oct/19 01:21 Worklog Time Spent: 10m Work Description: pabloem commented on issue #9855: [BEAM-8446] Retrying BQ query on timeouts URL: https://github.com/apache/beam/pull/9855#issuecomment-545700463 Again timeout without failures: https://builds.apache.org/job/beam_PostCommit_Python37_PR/46/ This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 332994) Time Spent: 3h 10m (was: 3h) > apache_beam.io.gcp.bigquery_write_it_test.BigQueryWriteIntegrationTests.test_big_query_write_new_types > is flaky > --- > > Key: BEAM-8446 > URL: https://issues.apache.org/jira/browse/BEAM-8446 > Project: Beam > Issue Type: New Feature > Components: test-failures >Reporter: Boyuan Zhang >Assignee: Pablo Estrada >Priority: Major > Time Spent: 3h 10m > Remaining Estimate: 0h > > test_big_query_write_new_types appears to be flaky in > beam_PostCommit_Python37 test suite. > https://builds.apache.org/job/beam_PostCommit_Python37/733/ > https://builds.apache.org/job/beam_PostCommit_Python37/739/ -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8446) apache_beam.io.gcp.bigquery_write_it_test.BigQueryWriteIntegrationTests.test_big_query_write_new_types is flaky
[ https://issues.apache.org/jira/browse/BEAM-8446?focusedWorklogId=332993=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332993 ] ASF GitHub Bot logged work on BEAM-8446: Author: ASF GitHub Bot Created on: 24/Oct/19 01:20 Start Date: 24/Oct/19 01:20 Worklog Time Spent: 10m Work Description: pabloem commented on issue #9855: [BEAM-8446] Retrying BQ query on timeouts URL: https://github.com/apache/beam/pull/9855#issuecomment-545700383 Run Python 3.7 PostCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 332993) Time Spent: 3h (was: 2h 50m) > apache_beam.io.gcp.bigquery_write_it_test.BigQueryWriteIntegrationTests.test_big_query_write_new_types > is flaky > --- > > Key: BEAM-8446 > URL: https://issues.apache.org/jira/browse/BEAM-8446 > Project: Beam > Issue Type: New Feature > Components: test-failures >Reporter: Boyuan Zhang >Assignee: Pablo Estrada >Priority: Major > Time Spent: 3h > Remaining Estimate: 0h > > test_big_query_write_new_types appears to be flaky in > beam_PostCommit_Python37 test suite. > https://builds.apache.org/job/beam_PostCommit_Python37/733/ > https://builds.apache.org/job/beam_PostCommit_Python37/739/ -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8146) SchemaCoder/RowCoder have no equals() function
[ https://issues.apache.org/jira/browse/BEAM-8146?focusedWorklogId=332987=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332987 ] ASF GitHub Bot logged work on BEAM-8146: Author: ASF GitHub Bot Created on: 24/Oct/19 01:09 Start Date: 24/Oct/19 01:09 Worklog Time Spent: 10m Work Description: kennknowles commented on issue #9493: [BEAM-8146,BEAM-8204,BEAM-8205] Add equals and hashCode to SchemaCoder and RowCoder URL: https://github.com/apache/beam/pull/9493#issuecomment-545698025 I would say keep as-is. Being able to implement `getEncodedTypeDescriptor` adds value on its own. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 332987) Time Spent: 3h (was: 2h 50m) > SchemaCoder/RowCoder have no equals() function > -- > > Key: BEAM-8146 > URL: https://issues.apache.org/jira/browse/BEAM-8146 > Project: Beam > Issue Type: Bug > Components: sdk-java-core >Affects Versions: 2.15.0 >Reporter: Brian Hulette >Assignee: Brian Hulette >Priority: Major > Time Spent: 3h > Remaining Estimate: 0h > > SchemaCoder has no equals function, so it can't be compared in tests, like > CloudComponentsTests$DefaultCoders, which is being re-enabled in > https://github.com/apache/beam/pull/9446 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8146) SchemaCoder/RowCoder have no equals() function
[ https://issues.apache.org/jira/browse/BEAM-8146?focusedWorklogId=332989=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332989 ] ASF GitHub Bot logged work on BEAM-8146: Author: ASF GitHub Bot Created on: 24/Oct/19 01:09 Start Date: 24/Oct/19 01:09 Worklog Time Spent: 10m Work Description: kennknowles commented on issue #9493: [BEAM-8146,BEAM-8204,BEAM-8205] Add equals and hashCode to SchemaCoder and RowCoder URL: https://github.com/apache/beam/pull/9493#issuecomment-545698141 Given how long you've been waiting on this, I'm inclined to merge. Last chance to comment @reuvenlax ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 332989) Time Spent: 3h 10m (was: 3h) > SchemaCoder/RowCoder have no equals() function > -- > > Key: BEAM-8146 > URL: https://issues.apache.org/jira/browse/BEAM-8146 > Project: Beam > Issue Type: Bug > Components: sdk-java-core >Affects Versions: 2.15.0 >Reporter: Brian Hulette >Assignee: Brian Hulette >Priority: Major > Time Spent: 3h 10m > Remaining Estimate: 0h > > SchemaCoder has no equals function, so it can't be compared in tests, like > CloudComponentsTests$DefaultCoders, which is being re-enabled in > https://github.com/apache/beam/pull/9446 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7981) ParDo function wrapper doesn't support Iterable output types
[ https://issues.apache.org/jira/browse/BEAM-7981?focusedWorklogId=332986=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332986 ] ASF GitHub Bot logged work on BEAM-7981: Author: ASF GitHub Bot Created on: 24/Oct/19 00:49 Start Date: 24/Oct/19 00:49 Worklog Time Spent: 10m Work Description: udim commented on pull request #9708: [BEAM-7981] Fix double iterable stripping URL: https://github.com/apache/beam/pull/9708 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 332986) Time Spent: 2h 10m (was: 2h) > ParDo function wrapper doesn't support Iterable output types > > > Key: BEAM-7981 > URL: https://issues.apache.org/jira/browse/BEAM-7981 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Udi Meiri >Assignee: Udi Meiri >Priority: Major > Time Spent: 2h 10m > Remaining Estimate: 0h > > I believe the bug is in CallableWrapperDoFn.default_type_hints, which > converts Iterable[str] to str. > This test will be included (commented out) in > https://github.com/apache/beam/pull/9283 > {code} > def test_typed_callable_iterable_output(self): > @typehints.with_input_types(int) > @typehints.with_output_types(typehints.Iterable[str]) > def do_fn(element): > return [[str(element)] * 2] > result = [1, 2] | beam.ParDo(do_fn) > self.assertEqual([['1', '1'], ['2', '2']], sorted(result)) > {code} > Result: > {code} > == > ERROR: test_typed_callable_iterable_output > (apache_beam.typehints.typed_pipeline_test.MainInputTest) > -- > Traceback (most recent call last): > File > "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/typehints/typed_pipeline_test.py", > line 104, in test_typed_callable_iterable_output > result = [1, 2] | beam.ParDo(do_fn) > File > "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/transforms/ptransform.py", > line 519, in __ror__ > p.run().wait_until_finish() > File > "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/pipeline.py", > line 406, in run > self._options).run(False) > File > "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/pipeline.py", > line 419, in run > return self.runner.run_pipeline(self, self._options) > File > "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/runners/direct/direct_runner.py", > line 129, in run_pipeline > return runner.run_pipeline(pipeline, options) > File > "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/runners/portability/fn_api_runner.py", > line 366, in run_pipeline > default_environment=self._default_environment)) > File > "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/runners/portability/fn_api_runner.py", > line 373, in run_via_runner_api > return self.run_stages(stage_context, stages) > File > "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/runners/portability/fn_api_runner.py", > line 455, in run_stages > stage_context.safe_coders) > File > "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/runners/portability/fn_api_runner.py", > line 733, in _run_stage > result, splits = bundle_manager.process_bundle(data_input, data_output) > File > "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/runners/portability/fn_api_runner.py", > line 1663, in process_bundle > part, expected_outputs), part_inputs): > File "/usr/lib/python3.7/concurrent/futures/_base.py", line 586, in > result_iterator > yield fs.pop().result() > File "/usr/lib/python3.7/concurrent/futures/_base.py", line 432, in result > return self.__get_result() > File "/usr/lib/python3.7/concurrent/futures/_base.py", line 384, in > __get_result > raise self._exception > File "/usr/lib/python3.7/concurrent/futures/thread.py", line 57, in run > result = self.fn(*self.args, **self.kwargs) > File > "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/runners/portability/fn_api_runner.py", > line 1663, in > part, expected_outputs), part_inputs): > File > "/usr/local/google/home/ehudm/src/beam/sdks/python/apache_beam/runners/portability/fn_api_runner.py", > line 1601, in process_bundle > result_future =
[jira] [Work logged] (BEAM-7886) Make row coder a standard coder and implement in python
[ https://issues.apache.org/jira/browse/BEAM-7886?focusedWorklogId=332985=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332985 ] ASF GitHub Bot logged work on BEAM-7886: Author: ASF GitHub Bot Created on: 24/Oct/19 00:46 Start Date: 24/Oct/19 00:46 Worklog Time Spent: 10m Work Description: TheNeuralBit commented on issue #9188: [BEAM-7886] Make row coder a standard coder and implement in Python URL: https://github.com/apache/beam/pull/9188#issuecomment-545693677 Run Python PreCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 332985) Time Spent: 14h 10m (was: 14h) > Make row coder a standard coder and implement in python > --- > > Key: BEAM-7886 > URL: https://issues.apache.org/jira/browse/BEAM-7886 > Project: Beam > Issue Type: Improvement > Components: beam-model, sdk-java-core, sdk-py-core >Reporter: Brian Hulette >Assignee: Brian Hulette >Priority: Major > Time Spent: 14h 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8451) Interactive Beam example failing from stack overflow
[ https://issues.apache.org/jira/browse/BEAM-8451?focusedWorklogId=332983=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332983 ] ASF GitHub Bot logged work on BEAM-8451: Author: ASF GitHub Bot Created on: 24/Oct/19 00:35 Start Date: 24/Oct/19 00:35 Worklog Time Spent: 10m Work Description: aaltay commented on issue #9865: [BEAM-8451] Fix interactive beam max recursion err URL: https://github.com/apache/beam/pull/9865#issuecomment-545691833 R: @KevinGG This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 332983) Time Spent: 0.5h (was: 20m) > Interactive Beam example failing from stack overflow > > > Key: BEAM-8451 > URL: https://issues.apache.org/jira/browse/BEAM-8451 > Project: Beam > Issue Type: Bug > Components: examples-python, runner-py-interactive >Reporter: Igor Durovic >Assignee: Igor Durovic >Priority: Major > Time Spent: 0.5h > Remaining Estimate: 0h > > > RecursionError: maximum recursion depth exceeded in __instancecheck__ > at > [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/interactive/pipeline_analyzer.py#L405] > > This occurred after the execution of the last cell in > [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/interactive/examples/Interactive%20Beam%20Example.ipynb] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8451) Interactive Beam example failing from stack overflow
[ https://issues.apache.org/jira/browse/BEAM-8451?focusedWorklogId=332980=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332980 ] ASF GitHub Bot logged work on BEAM-8451: Author: ASF GitHub Bot Created on: 24/Oct/19 00:16 Start Date: 24/Oct/19 00:16 Worklog Time Spent: 10m Work Description: chunyang commented on issue #9865: [BEAM-8451] Fix interactive beam max recursion err URL: https://github.com/apache/beam/pull/9865#issuecomment-545688129 Run PythonLint PreCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 332980) Time Spent: 20m (was: 10m) > Interactive Beam example failing from stack overflow > > > Key: BEAM-8451 > URL: https://issues.apache.org/jira/browse/BEAM-8451 > Project: Beam > Issue Type: Bug > Components: examples-python, runner-py-interactive >Reporter: Igor Durovic >Assignee: Igor Durovic >Priority: Major > Time Spent: 20m > Remaining Estimate: 0h > > > RecursionError: maximum recursion depth exceeded in __instancecheck__ > at > [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/interactive/pipeline_analyzer.py#L405] > > This occurred after the execution of the last cell in > [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/interactive/examples/Interactive%20Beam%20Example.ipynb] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8382) Add polling interval to KinesisIO.Read
[ https://issues.apache.org/jira/browse/BEAM-8382?focusedWorklogId=332974=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332974 ] ASF GitHub Bot logged work on BEAM-8382: Author: ASF GitHub Bot Created on: 24/Oct/19 00:07 Start Date: 24/Oct/19 00:07 Worklog Time Spent: 10m Work Description: jfarr commented on issue #9765: [BEAM-8382] Add polling interval to KinesisIO.Read URL: https://github.com/apache/beam/pull/9765#issuecomment-545686422 @aromanenko-dev I'm still testing and gathering data but one of my colleagues had an idea I wanted to run by you. What if we add a polling rate policy object and have fixed and fluent versions with fluent as the default? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 332974) Time Spent: 5h (was: 4h 50m) > Add polling interval to KinesisIO.Read > -- > > Key: BEAM-8382 > URL: https://issues.apache.org/jira/browse/BEAM-8382 > Project: Beam > Issue Type: Improvement > Components: io-java-kinesis >Affects Versions: 2.13.0, 2.14.0, 2.15.0 >Reporter: Jonothan Farr >Assignee: Jonothan Farr >Priority: Major > Time Spent: 5h > Remaining Estimate: 0h > > With the current implementation we are observing Kinesis throttling due to > ReadProvisionedThroughputExceeded on the order of hundreds of times per > second, regardless of the actual Kinesis throughput. This is because the > ShardReadersPool readLoop() method is polling getRecords() as fast as > possible. > From the KDS documentation: > {quote}Each shard can support up to five read transactions per second. > {quote} > and > {quote}For best results, sleep for at least 1 second (1,000 milliseconds) > between calls to getRecords to avoid exceeding the limit on getRecords > frequency. > {quote} > [https://docs.aws.amazon.com/streams/latest/dev/service-sizes-and-limits.html] > [https://docs.aws.amazon.com/streams/latest/dev/developing-consumers-with-sdk.html] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-8397) DataflowRunnerTest.test_remote_runner_display_data fails due to infinite recursion during pickling.
[ https://issues.apache.org/jira/browse/BEAM-8397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958419#comment-16958419 ] Valentyn Tymofieiev commented on BEAM-8397: --- (Did not see an update from previous comment) > DataflowRunnerTest.test_remote_runner_display_data fails due to infinite > recursion during pickling. > --- > > Key: BEAM-8397 > URL: https://issues.apache.org/jira/browse/BEAM-8397 > Project: Beam > Issue Type: Sub-task > Components: sdk-py-core >Reporter: Valentyn Tymofieiev >Assignee: Valentyn Tymofieiev >Priority: Major > > `python ./setup.py test -s > apache_beam.runners.dataflow.dataflow_runner_test.DataflowRunnerTest.test_remote_runner_display_data` > passes. > `tox -e py37-gcp` passes if Beam depends on dill==0.3.0, but fails if Beam > depends on dill==0.3.1.1.`python ./setup.py nosetests --tests > 'apache_beam/runners/dataflow/dataflow_runner_test.py:DataflowRunnerTest.test_remote_runner_display_data` > fails currently if run on master. > The failure indicates infinite recursion during pickling: > {noformat} > test_remote_runner_display_data > (apache_beam.runners.dataflow.dataflow_runner_test.DataflowRunnerTest) ... > Fatal Python error: Cannot recover from stack overflow. > Current thread 0x7f9d700ed740 (most recent call first): > File "/usr/lib/python3.7/pickle.py", line 479 in get > File "/usr/lib/python3.7/pickle.py", line 497 in save > File "/usr/lib/python3.7/pickle.py", line 786 in save_tuple > File "/usr/lib/python3.7/pickle.py", line 504 in save > File "/usr/lib/python3.7/pickle.py", line 638 in save_reduce > File > "/usr/local/google/home/valentyn/tmp/py37env/lib/python3.7/site-packages/dill/_dill.py", > line 1394 in save_function > File "/usr/lib/python3.7/pickle.py", line 504 in save > File "/usr/lib/python3.7/pickle.py", line 882 in _batch_setitems > File "/usr/lib/python3.7/pickle.py", line 856 in save_dict > File > "/usr/local/google/home/valentyn/tmp/py37env/lib/python3.7/site-packages/dill/_dill.py", > line 910 in save_module_dict > File > "/usr/local/google/home/valentyn/projects/beam/clean/beam/sdks/python/apache_beam/internal/pickler.py", > line 198 in new_save_module_dict > File "/usr/lib/python3.7/pickle.py", line 504 in save > File "/usr/lib/python3.7/pickle.py", line 786 in save_tuple > File "/usr/lib/python3.7/pickle.py", line 504 in save > File "/usr/lib/python3.7/pickle.py", line 638 in save_reduce > File > "/usr/local/google/home/valentyn/projects/beam/clean/beam/sdks/python/apache_beam/internal/pickler.py", > line 114 in wrapper > File "/usr/lib/python3.7/pickle.py", line 504 in save > File "/usr/lib/python3.7/pickle.py", line 771 in save_tuple > File "/usr/lib/python3.7/pickle.py", line 504 in save > File "/usr/lib/python3.7/pickle.py", line 638 in save_reduce > File > "/usr/local/google/home/valentyn/tmp/py37env/lib/python3.7/site-packages/dill/_dill.py", > line 1137 in save_cell > File "/usr/lib/python3.7/pickle.py", line 504 in save > File "/usr/lib/python3.7/pickle.py", line 771 in save_tuple > File "/usr/lib/python3.7/pickle.py", line 504 in save > File "/usr/lib/python3.7/pickle.py", line 786 in save_tuple > File "/usr/lib/python3.7/pickle.py", line 504 in save > File "/usr/lib/python3.7/pickle.py", line 638 in save_reduce > File > "/usr/local/google/home/valentyn/tmp/py37env/lib/python3.7/site-packages/dill/_dill.py", > line 1394 in save_function > File "/usr/lib/python3.7/pickle.py", line 504 in save > File "/usr/lib/python3.7/pickle.py", line 882 in _batch_setitems > File "/usr/lib/python3.7/pickle.py", line 856 in save_dict > File > "/usr/local/google/home/valentyn/tmp/py37env/lib/python3.7/site-packages/dill/_dill.py", > line 910 in save_module_dict > File > "/usr/local/google/home/valentyn/projects/beam/clean/beam/sdks/python/apache_beam/internal/pickler.py", > line 198 in new_save_module_dict > ... > {noformat} > cc: [~yoshiki.obata] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-8397) DataflowRunnerTest.test_remote_runner_display_data fails due to infinite recursion during pickling.
[ https://issues.apache.org/jira/browse/BEAM-8397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958417#comment-16958417 ] Valentyn Tymofieiev commented on BEAM-8397: --- [~udim] pointed out "asubcomponent" entry is associated with an instance of HasDisplayData, so instead of adding "asubcomponent" entry, we accumulate associated display_data values as per [1], and thus populate the dofn_value key. so the scenario of the existing test makes sense. However, when we move SpecialDoFn, SpecialParDo out of test_remote_runner_display_data method, the display_data suddenly becomes: {noformat} [{'key': 'dofn_value', 'namespace': 'apache_beam.runners.dataflow.dataflow_runner_test.SpecialDoFn', 'type': 'INTEGER', 'value': 42}, {'key': 'fn', 'namespace': 'apache_beam.transforms.core.ParDo', 'shortValue': 'SpecialDoFn', 'label': 'Transform Function', 'type': 'STRING', 'value': 'apache_beam.runners.dataflow.dataflow_runner_test.SpecialDoFn'}] {noformat} While the expected display_data is {noformat} [{'key': 'a_time, 'namespace': 'apache_beam.runners.dataflow.dataflow_runner_test.SpecialParDo', 'type': 'TIMESTAMP', 'value': 1571829781919, '}, {'key': 'a_class', 'shortValue': 'SpecialParDo', 'namespace': 'apache_beam.runners.dataflow.dataflow_runner_test.SpecialParDo', 'type': 'STRING', 'value': 'apache_beam.runners.dataflow.dataflow_runner_test.SpecialParDo'}, {'key': 'dofn_value', 'namespace': 'apache_beam.runners.dataflow.dataflow_runner_test.SpecialDoFn', 'type': 'INTEGER', 'value': 42,}] {noformat} It appears that display_data method of SpecialParDo is no longer called, and instead display_data is populated from display_data of ParDo class, note the namespace is transforms.core.ParDo. This rootcause of this behavior is unclear so far. [1] [https://github.com/apache/beam/blob/8e574a177794aa04345cf8cd0c2ce1717ee40ec6/sdks/python/apache_beam/transforms/display.py#L100] > DataflowRunnerTest.test_remote_runner_display_data fails due to infinite > recursion during pickling. > --- > > Key: BEAM-8397 > URL: https://issues.apache.org/jira/browse/BEAM-8397 > Project: Beam > Issue Type: Sub-task > Components: sdk-py-core >Reporter: Valentyn Tymofieiev >Assignee: Valentyn Tymofieiev >Priority: Major > > `python ./setup.py test -s > apache_beam.runners.dataflow.dataflow_runner_test.DataflowRunnerTest.test_remote_runner_display_data` > passes. > `tox -e py37-gcp` passes if Beam depends on dill==0.3.0, but fails if Beam > depends on dill==0.3.1.1.`python ./setup.py nosetests --tests > 'apache_beam/runners/dataflow/dataflow_runner_test.py:DataflowRunnerTest.test_remote_runner_display_data` > fails currently if run on master. > The failure indicates infinite recursion during pickling: > {noformat} > test_remote_runner_display_data > (apache_beam.runners.dataflow.dataflow_runner_test.DataflowRunnerTest) ... > Fatal Python error: Cannot recover from stack overflow. > Current thread 0x7f9d700ed740 (most recent call first): > File "/usr/lib/python3.7/pickle.py", line 479 in get > File "/usr/lib/python3.7/pickle.py", line 497 in save > File "/usr/lib/python3.7/pickle.py", line 786 in save_tuple > File "/usr/lib/python3.7/pickle.py", line 504 in save > File "/usr/lib/python3.7/pickle.py", line 638 in save_reduce > File > "/usr/local/google/home/valentyn/tmp/py37env/lib/python3.7/site-packages/dill/_dill.py", > line 1394 in save_function > File "/usr/lib/python3.7/pickle.py", line 504 in save > File "/usr/lib/python3.7/pickle.py", line 882 in _batch_setitems > File "/usr/lib/python3.7/pickle.py", line 856 in save_dict > File > "/usr/local/google/home/valentyn/tmp/py37env/lib/python3.7/site-packages/dill/_dill.py", > line 910 in save_module_dict > File > "/usr/local/google/home/valentyn/projects/beam/clean/beam/sdks/python/apache_beam/internal/pickler.py", > line 198 in new_save_module_dict > File "/usr/lib/python3.7/pickle.py", line 504 in save > File "/usr/lib/python3.7/pickle.py", line 786 in save_tuple > File "/usr/lib/python3.7/pickle.py", line 504 in save > File "/usr/lib/python3.7/pickle.py", line 638 in save_reduce > File > "/usr/local/google/home/valentyn/projects/beam/clean/beam/sdks/python/apache_beam/internal/pickler.py", > line 114 in wrapper > File "/usr/lib/python3.7/pickle.py", line 504 in save > File "/usr/lib/python3.7/pickle.py", line 771 in save_tuple > File "/usr/lib/python3.7/pickle.py", line 504 in save > File "/usr/lib/python3.7/pickle.py", line 638 in save_reduce > File > "/usr/local/google/home/valentyn/tmp/py37env/lib/python3.7/site-packages/dill/_dill.py", > line 1137 in save_cell
[jira] [Updated] (BEAM-8466) Python typehints: pep 484 warn and strict modes
[ https://issues.apache.org/jira/browse/BEAM-8466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Udi Meiri updated BEAM-8466: Status: Open (was: Triage Needed) > Python typehints: pep 484 warn and strict modes > --- > > Key: BEAM-8466 > URL: https://issues.apache.org/jira/browse/BEAM-8466 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Udi Meiri >Assignee: Udi Meiri >Priority: Major > > Allow type checking to use PEP 484 type hints, but only warn if there are > errors, and in another mode to raise exceptions on errors. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-8397) DataflowRunnerTest.test_remote_runner_display_data fails due to infinite recursion during pickling.
[ https://issues.apache.org/jira/browse/BEAM-8397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958414#comment-16958414 ] Udi Meiri commented on BEAM-8397: - 1. The "asubcomponent" entry is replaced by self.fn.display_data(). See this code: https://github.com/apache/beam/blob/8e574a177794aa04345cf8cd0c2ce1717ee40ec6/sdks/python/apache_beam/transforms/display.py#L99-L103 2. Beam's pickler.py tries to handle nested classes, but can only do so if the class is nested within another class. In this case the class is nested inside a method. This code shows that it only recurses into classes: https://github.com/apache/beam/blob/a757475e2017bbc92871e9d3507843fc0e1a9611/sdks/python/apache_beam/internal/pickler.py#L75 What happens in this test case is still a mystery to me. No idea why it works in some cases vs not in others. Also, if we move SpecialPardo to be module-level, its display_data() is never (ParDo.display_data is called instead). [~robertwb] do you remember the intention for this test? Do users successfully subclass ParDo and override display_data in their pipelines? > DataflowRunnerTest.test_remote_runner_display_data fails due to infinite > recursion during pickling. > --- > > Key: BEAM-8397 > URL: https://issues.apache.org/jira/browse/BEAM-8397 > Project: Beam > Issue Type: Sub-task > Components: sdk-py-core >Reporter: Valentyn Tymofieiev >Assignee: Valentyn Tymofieiev >Priority: Major > > `python ./setup.py test -s > apache_beam.runners.dataflow.dataflow_runner_test.DataflowRunnerTest.test_remote_runner_display_data` > passes. > `tox -e py37-gcp` passes if Beam depends on dill==0.3.0, but fails if Beam > depends on dill==0.3.1.1.`python ./setup.py nosetests --tests > 'apache_beam/runners/dataflow/dataflow_runner_test.py:DataflowRunnerTest.test_remote_runner_display_data` > fails currently if run on master. > The failure indicates infinite recursion during pickling: > {noformat} > test_remote_runner_display_data > (apache_beam.runners.dataflow.dataflow_runner_test.DataflowRunnerTest) ... > Fatal Python error: Cannot recover from stack overflow. > Current thread 0x7f9d700ed740 (most recent call first): > File "/usr/lib/python3.7/pickle.py", line 479 in get > File "/usr/lib/python3.7/pickle.py", line 497 in save > File "/usr/lib/python3.7/pickle.py", line 786 in save_tuple > File "/usr/lib/python3.7/pickle.py", line 504 in save > File "/usr/lib/python3.7/pickle.py", line 638 in save_reduce > File > "/usr/local/google/home/valentyn/tmp/py37env/lib/python3.7/site-packages/dill/_dill.py", > line 1394 in save_function > File "/usr/lib/python3.7/pickle.py", line 504 in save > File "/usr/lib/python3.7/pickle.py", line 882 in _batch_setitems > File "/usr/lib/python3.7/pickle.py", line 856 in save_dict > File > "/usr/local/google/home/valentyn/tmp/py37env/lib/python3.7/site-packages/dill/_dill.py", > line 910 in save_module_dict > File > "/usr/local/google/home/valentyn/projects/beam/clean/beam/sdks/python/apache_beam/internal/pickler.py", > line 198 in new_save_module_dict > File "/usr/lib/python3.7/pickle.py", line 504 in save > File "/usr/lib/python3.7/pickle.py", line 786 in save_tuple > File "/usr/lib/python3.7/pickle.py", line 504 in save > File "/usr/lib/python3.7/pickle.py", line 638 in save_reduce > File > "/usr/local/google/home/valentyn/projects/beam/clean/beam/sdks/python/apache_beam/internal/pickler.py", > line 114 in wrapper > File "/usr/lib/python3.7/pickle.py", line 504 in save > File "/usr/lib/python3.7/pickle.py", line 771 in save_tuple > File "/usr/lib/python3.7/pickle.py", line 504 in save > File "/usr/lib/python3.7/pickle.py", line 638 in save_reduce > File > "/usr/local/google/home/valentyn/tmp/py37env/lib/python3.7/site-packages/dill/_dill.py", > line 1137 in save_cell > File "/usr/lib/python3.7/pickle.py", line 504 in save > File "/usr/lib/python3.7/pickle.py", line 771 in save_tuple > File "/usr/lib/python3.7/pickle.py", line 504 in save > File "/usr/lib/python3.7/pickle.py", line 786 in save_tuple > File "/usr/lib/python3.7/pickle.py", line 504 in save > File "/usr/lib/python3.7/pickle.py", line 638 in save_reduce > File > "/usr/local/google/home/valentyn/tmp/py37env/lib/python3.7/site-packages/dill/_dill.py", > line 1394 in save_function > File "/usr/lib/python3.7/pickle.py", line 504 in save > File "/usr/lib/python3.7/pickle.py", line 882 in _batch_setitems > File "/usr/lib/python3.7/pickle.py", line 856 in save_dict > File > "/usr/local/google/home/valentyn/tmp/py37env/lib/python3.7/site-packages/dill/_dill.py", > line 910 in save_module_dict > File >
[jira] [Work logged] (BEAM-8456) BigQuery to Beam SQL timestamp has the wrong default: truncation makes the most sense
[ https://issues.apache.org/jira/browse/BEAM-8456?focusedWorklogId=332965=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332965 ] ASF GitHub Bot logged work on BEAM-8456: Author: ASF GitHub Bot Created on: 23/Oct/19 23:21 Start Date: 23/Oct/19 23:21 Worklog Time Spent: 10m Work Description: apilloud commented on pull request #9849: [BEAM-8456] Add pipeline option to have Data Catalog truncate sub-millisecond precision URL: https://github.com/apache/beam/pull/9849#discussion_r338318184 ## File path: sdks/java/extensions/sql/datacatalog/src/main/java/org/apache/beam/sdk/extensions/sql/meta/provider/datacatalog/DataCatalogTableProvider.java ## @@ -138,8 +143,41 @@ private Table loadTableFromDC(String tableName) { } } - @Override - public BeamSqlTable buildBeamSqlTable(Table table) { -return delegateProviders.get(table.getType()).buildBeamSqlTable(table); + private static DataCatalogBlockingStub createDataCatalogClient( + DataCatalogPipelineOptions options) { +return DataCatalogGrpc.newBlockingStub( + ManagedChannelBuilder.forTarget(options.getDataCatalogEndpoint()).build()) +.withCallCredentials( + MoreCallCredentials.from(options.as(GcpOptions.class).getGcpCredential())); + } + + private static Map getSupportedProviders() { +return Stream.of( +new PubsubJsonTableProvider(), new BigQueryTableProvider(), new TextTableProvider()) +.collect(toMap(TableProvider::getTableType, p -> p)); + } + + private Table toCalciteTable(String tableName, Entry entry) { +if (entry.getSchema().getColumnsCount() == 0) { + throw new UnsupportedOperationException( + "Entry doesn't have a schema. Please attach a schema to '" + + tableName + + "' in Data Catalog: " + + entry.toString()); +} +Schema schema = SchemaUtils.fromDataCatalog(entry.getSchema()); + +Optional tableBuilder = tableFactory.tableBuilder(entry); +if (tableBuilder.isPresent()) { + return tableBuilder.get().schema(schema).name(tableName).build(); +} else { Review comment: nit: unneeded `else`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 332965) Time Spent: 1h 50m (was: 1h 40m) > BigQuery to Beam SQL timestamp has the wrong default: truncation makes the > most sense > - > > Key: BEAM-8456 > URL: https://issues.apache.org/jira/browse/BEAM-8456 > Project: Beam > Issue Type: Improvement > Components: dsl-sql >Reporter: Kenneth Knowles >Assignee: Kenneth Knowles >Priority: Major > Time Spent: 1h 50m > Remaining Estimate: 0h > > Most of the time, a user reading a timestamp from BigQuery with > higher-than-millisecond precision timestamps may not even realize that the > data source created these high precision timestamps. They are probably > timestamps on log entries generated by a system with higher precision. > If they are using it with Beam SQL, which only supports millisecond > precision, it makes sense to "just work" by default. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8456) BigQuery to Beam SQL timestamp has the wrong default: truncation makes the most sense
[ https://issues.apache.org/jira/browse/BEAM-8456?focusedWorklogId=332959=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332959 ] ASF GitHub Bot logged work on BEAM-8456: Author: ASF GitHub Bot Created on: 23/Oct/19 23:21 Start Date: 23/Oct/19 23:21 Worklog Time Spent: 10m Work Description: apilloud commented on pull request #9849: [BEAM-8456] Add pipeline option to have Data Catalog truncate sub-millisecond precision URL: https://github.com/apache/beam/pull/9849#discussion_r338310639 ## File path: sdks/java/extensions/sql/datacatalog/src/main/java/org/apache/beam/sdk/extensions/sql/meta/provider/datacatalog/DataCatalogTableProvider.java ## @@ -138,8 +143,41 @@ private Table loadTableFromDC(String tableName) { } } - @Override - public BeamSqlTable buildBeamSqlTable(Table table) { -return delegateProviders.get(table.getType()).buildBeamSqlTable(table); + private static DataCatalogBlockingStub createDataCatalogClient( + DataCatalogPipelineOptions options) { +return DataCatalogGrpc.newBlockingStub( + ManagedChannelBuilder.forTarget(options.getDataCatalogEndpoint()).build()) +.withCallCredentials( + MoreCallCredentials.from(options.as(GcpOptions.class).getGcpCredential())); + } + + private static Map getSupportedProviders() { +return Stream.of( +new PubsubJsonTableProvider(), new BigQueryTableProvider(), new TextTableProvider()) Review comment: nit: This could be inlined or even turned into a constant. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 332959) Time Spent: 1h 20m (was: 1h 10m) > BigQuery to Beam SQL timestamp has the wrong default: truncation makes the > most sense > - > > Key: BEAM-8456 > URL: https://issues.apache.org/jira/browse/BEAM-8456 > Project: Beam > Issue Type: Improvement > Components: dsl-sql >Reporter: Kenneth Knowles >Assignee: Kenneth Knowles >Priority: Major > Time Spent: 1h 20m > Remaining Estimate: 0h > > Most of the time, a user reading a timestamp from BigQuery with > higher-than-millisecond precision timestamps may not even realize that the > data source created these high precision timestamps. They are probably > timestamps on log entries generated by a system with higher precision. > If they are using it with Beam SQL, which only supports millisecond > precision, it makes sense to "just work" by default. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8456) BigQuery to Beam SQL timestamp has the wrong default: truncation makes the most sense
[ https://issues.apache.org/jira/browse/BEAM-8456?focusedWorklogId=332961=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332961 ] ASF GitHub Bot logged work on BEAM-8456: Author: ASF GitHub Bot Created on: 23/Oct/19 23:21 Start Date: 23/Oct/19 23:21 Worklog Time Spent: 10m Work Description: apilloud commented on pull request #9849: [BEAM-8456] Add pipeline option to have Data Catalog truncate sub-millisecond precision URL: https://github.com/apache/beam/pull/9849#discussion_r338321189 ## File path: sdks/java/extensions/sql/datacatalog/src/main/java/org/apache/beam/sdk/extensions/sql/meta/provider/datacatalog/DataCatalogTableProvider.java ## @@ -138,8 +143,41 @@ private Table loadTableFromDC(String tableName) { } } - @Override - public BeamSqlTable buildBeamSqlTable(Table table) { -return delegateProviders.get(table.getType()).buildBeamSqlTable(table); + private static DataCatalogBlockingStub createDataCatalogClient( + DataCatalogPipelineOptions options) { +return DataCatalogGrpc.newBlockingStub( + ManagedChannelBuilder.forTarget(options.getDataCatalogEndpoint()).build()) +.withCallCredentials( + MoreCallCredentials.from(options.as(GcpOptions.class).getGcpCredential())); + } + + private static Map getSupportedProviders() { +return Stream.of( +new PubsubJsonTableProvider(), new BigQueryTableProvider(), new TextTableProvider()) +.collect(toMap(TableProvider::getTableType, p -> p)); + } + + private Table toCalciteTable(String tableName, Entry entry) { +if (entry.getSchema().getColumnsCount() == 0) { + throw new UnsupportedOperationException( + "Entry doesn't have a schema. Please attach a schema to '" + + tableName + + "' in Data Catalog: " + + entry.toString()); +} +Schema schema = SchemaUtils.fromDataCatalog(entry.getSchema()); + +Optional tableBuilder = tableFactory.tableBuilder(entry); +if (tableBuilder.isPresent()) { + return tableBuilder.get().schema(schema).name(tableName).build(); Review comment: nit: This pattern gives you all the pain of `Optional` with none of the benefits. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 332961) Time Spent: 1h 40m (was: 1.5h) > BigQuery to Beam SQL timestamp has the wrong default: truncation makes the > most sense > - > > Key: BEAM-8456 > URL: https://issues.apache.org/jira/browse/BEAM-8456 > Project: Beam > Issue Type: Improvement > Components: dsl-sql >Reporter: Kenneth Knowles >Assignee: Kenneth Knowles >Priority: Major > Time Spent: 1h 40m > Remaining Estimate: 0h > > Most of the time, a user reading a timestamp from BigQuery with > higher-than-millisecond precision timestamps may not even realize that the > data source created these high precision timestamps. They are probably > timestamps on log entries generated by a system with higher precision. > If they are using it with Beam SQL, which only supports millisecond > precision, it makes sense to "just work" by default. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8456) BigQuery to Beam SQL timestamp has the wrong default: truncation makes the most sense
[ https://issues.apache.org/jira/browse/BEAM-8456?focusedWorklogId=332963=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332963 ] ASF GitHub Bot logged work on BEAM-8456: Author: ASF GitHub Bot Created on: 23/Oct/19 23:21 Start Date: 23/Oct/19 23:21 Worklog Time Spent: 10m Work Description: apilloud commented on pull request #9849: [BEAM-8456] Add pipeline option to have Data Catalog truncate sub-millisecond precision URL: https://github.com/apache/beam/pull/9849#discussion_r338317396 ## File path: sdks/java/extensions/sql/datacatalog/src/main/java/org/apache/beam/sdk/extensions/sql/meta/provider/datacatalog/TableFactory.java ## @@ -0,0 +1,26 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.beam.sdk.extensions.sql.meta.provider.datacatalog; + +import com.google.cloud.datacatalog.Entry; +import java.util.Optional; +import org.apache.beam.sdk.extensions.sql.meta.Table.Builder; + +interface TableFactory { + Optional tableBuilder(Entry entry); Review comment: It is probably worth documenting this interface. Particularly that the `TableFactory` should check to see if the `Entry` belongs to it and return `Optional.empty()` if it doesn't. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 332963) Time Spent: 1h 40m (was: 1.5h) > BigQuery to Beam SQL timestamp has the wrong default: truncation makes the > most sense > - > > Key: BEAM-8456 > URL: https://issues.apache.org/jira/browse/BEAM-8456 > Project: Beam > Issue Type: Improvement > Components: dsl-sql >Reporter: Kenneth Knowles >Assignee: Kenneth Knowles >Priority: Major > Time Spent: 1h 40m > Remaining Estimate: 0h > > Most of the time, a user reading a timestamp from BigQuery with > higher-than-millisecond precision timestamps may not even realize that the > data source created these high precision timestamps. They are probably > timestamps on log entries generated by a system with higher precision. > If they are using it with Beam SQL, which only supports millisecond > precision, it makes sense to "just work" by default. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8456) BigQuery to Beam SQL timestamp has the wrong default: truncation makes the most sense
[ https://issues.apache.org/jira/browse/BEAM-8456?focusedWorklogId=332960=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332960 ] ASF GitHub Bot logged work on BEAM-8456: Author: ASF GitHub Bot Created on: 23/Oct/19 23:21 Start Date: 23/Oct/19 23:21 Worklog Time Spent: 10m Work Description: apilloud commented on pull request #9849: [BEAM-8456] Add pipeline option to have Data Catalog truncate sub-millisecond precision URL: https://github.com/apache/beam/pull/9849#discussion_r338307365 ## File path: sdks/java/extensions/sql/datacatalog/src/main/java/org/apache/beam/sdk/extensions/sql/meta/provider/datacatalog/BigQueryTableFactory.java ## @@ -20,23 +20,39 @@ import com.alibaba.fastjson.JSONObject; import com.google.cloud.datacatalog.Entry; import java.net.URI; +import java.util.Optional; import java.util.regex.Matcher; import java.util.regex.Pattern; import org.apache.beam.sdk.extensions.sql.meta.Table; +import org.apache.beam.sdk.extensions.sql.meta.Table.Builder; +import org.apache.beam.vendor.guava.v26_0_jre.com.google.common.collect.ImmutableMap; -/** Utils to extract BQ-specific entry information. */ -class BigQueryUtils { +/** {@link TableFactory} that understands Data Catalog BigQuery entries. */ +class BigQueryTableFactory implements TableFactory { + private static final String BIGQUERY_API = "bigquery.googleapis.com"; private static final Pattern BQ_PATH_PATTERN = Pattern.compile( "/projects/(?[^/]+)/datasets/(?[^/]+)/tables/(?[^/]+)"); - static Table.Builder tableBuilder(Entry entry) { -return Table.builder() -.location(getLocation(entry)) -.properties(new JSONObject()) -.type("bigquery") -.comment(""); + private final boolean truncateTimestamps; + + public BigQueryTableFactory(boolean truncateTimestamps) { +this.truncateTimestamps = truncateTimestamps; + } + + @Override + public Optional tableBuilder(Entry entry) { +if (URI.create(entry.getLinkedResource()).getAuthority().toLowerCase().equals(BIGQUERY_API)) { + return Optional.of( + Table.builder() + .location(getLocation(entry)) + .properties(new JSONObject(ImmutableMap.of("truncateTimestamps", truncateTimestamps))) + .type("bigquery") + .comment("")); +} else { Review comment: nit: Unneeded else. This might be clearer if the `if` bailed out. For example: ``` if (!URI.equals()) { return Optional.empty(); } // Actual work ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 332960) Time Spent: 1.5h (was: 1h 20m) > BigQuery to Beam SQL timestamp has the wrong default: truncation makes the > most sense > - > > Key: BEAM-8456 > URL: https://issues.apache.org/jira/browse/BEAM-8456 > Project: Beam > Issue Type: Improvement > Components: dsl-sql >Reporter: Kenneth Knowles >Assignee: Kenneth Knowles >Priority: Major > Time Spent: 1.5h > Remaining Estimate: 0h > > Most of the time, a user reading a timestamp from BigQuery with > higher-than-millisecond precision timestamps may not even realize that the > data source created these high precision timestamps. They are probably > timestamps on log entries generated by a system with higher precision. > If they are using it with Beam SQL, which only supports millisecond > precision, it makes sense to "just work" by default. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8456) BigQuery to Beam SQL timestamp has the wrong default: truncation makes the most sense
[ https://issues.apache.org/jira/browse/BEAM-8456?focusedWorklogId=332958=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332958 ] ASF GitHub Bot logged work on BEAM-8456: Author: ASF GitHub Bot Created on: 23/Oct/19 23:21 Start Date: 23/Oct/19 23:21 Worklog Time Spent: 10m Work Description: apilloud commented on pull request #9849: [BEAM-8456] Add pipeline option to have Data Catalog truncate sub-millisecond precision URL: https://github.com/apache/beam/pull/9849#discussion_r338317511 ## File path: sdks/java/extensions/sql/datacatalog/src/main/java/org/apache/beam/sdk/extensions/sql/meta/provider/datacatalog/PubsubTableFactory.java ## @@ -20,22 +20,32 @@ import com.alibaba.fastjson.JSONObject; import com.google.cloud.datacatalog.Entry; import java.net.URI; +import java.util.Optional; import java.util.regex.Matcher; import java.util.regex.Pattern; import org.apache.beam.sdk.extensions.sql.meta.Table; +import org.apache.beam.sdk.extensions.sql.meta.Table.Builder; -/** Utils to extract Pubsub-specific entry information. */ -class PubsubUtils { +/** {@link TableFactory} that understands Data Catalog Pubsub entries. */ +class PubsubTableFactory implements TableFactory { + + private static final String PUBSUB_API = "pubsub.googleapis.com"; private static final Pattern PS_PATH_PATTERN = Pattern.compile("/projects/(?[^/]+)/topics/(?[^/]+)"); - static Table.Builder tableBuilder(Entry entry) { -return Table.builder() -.location(getLocation(entry)) -.properties(new JSONObject()) -.type("pubsub") -.comment(""); + @Override + public Optional tableBuilder(Entry entry) { +if (URI.create(entry.getLinkedResource()).getAuthority().toLowerCase().equals(PUBSUB_API)) { + return Optional.of( + Table.builder() + .location(getLocation(entry)) + .properties(new JSONObject()) + .type("pubsub") + .comment("")); +} else { Review comment: nit: Unneeded else. This might be clearer if the `if` bailed out. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 332958) Time Spent: 1h 10m (was: 1h) > BigQuery to Beam SQL timestamp has the wrong default: truncation makes the > most sense > - > > Key: BEAM-8456 > URL: https://issues.apache.org/jira/browse/BEAM-8456 > Project: Beam > Issue Type: Improvement > Components: dsl-sql >Reporter: Kenneth Knowles >Assignee: Kenneth Knowles >Priority: Major > Time Spent: 1h 10m > Remaining Estimate: 0h > > Most of the time, a user reading a timestamp from BigQuery with > higher-than-millisecond precision timestamps may not even realize that the > data source created these high precision timestamps. They are probably > timestamps on log entries generated by a system with higher precision. > If they are using it with Beam SQL, which only supports millisecond > precision, it makes sense to "just work" by default. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8456) BigQuery to Beam SQL timestamp has the wrong default: truncation makes the most sense
[ https://issues.apache.org/jira/browse/BEAM-8456?focusedWorklogId=332964=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332964 ] ASF GitHub Bot logged work on BEAM-8456: Author: ASF GitHub Bot Created on: 23/Oct/19 23:21 Start Date: 23/Oct/19 23:21 Worklog Time Spent: 10m Work Description: apilloud commented on pull request #9849: [BEAM-8456] Add pipeline option to have Data Catalog truncate sub-millisecond precision URL: https://github.com/apache/beam/pull/9849#discussion_r338308066 ## File path: sdks/java/extensions/sql/datacatalog/src/main/java/org/apache/beam/sdk/extensions/sql/meta/provider/datacatalog/DataCatalogPipelineOptions.java ## @@ -32,4 +32,12 @@ String getDataCatalogEndpoint(); void setDataCatalogEndpoint(String dataCatalogEndpoint); + + /** Whether to truncate timestamps in tables described by Data Catalog. */ + @Description("Truncate sub-millisecond precision timestamps in tables described by Data Catalog") + @Validation.Required + @Default.Boolean(false) + Boolean getTruncateTimestamps(); Review comment: Nit: `boolean` should work here and below. (is this a limitation of PipelineOptions)? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 332964) Time Spent: 1h 50m (was: 1h 40m) > BigQuery to Beam SQL timestamp has the wrong default: truncation makes the > most sense > - > > Key: BEAM-8456 > URL: https://issues.apache.org/jira/browse/BEAM-8456 > Project: Beam > Issue Type: Improvement > Components: dsl-sql >Reporter: Kenneth Knowles >Assignee: Kenneth Knowles >Priority: Major > Time Spent: 1h 50m > Remaining Estimate: 0h > > Most of the time, a user reading a timestamp from BigQuery with > higher-than-millisecond precision timestamps may not even realize that the > data source created these high precision timestamps. They are probably > timestamps on log entries generated by a system with higher precision. > If they are using it with Beam SQL, which only supports millisecond > precision, it makes sense to "just work" by default. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8456) BigQuery to Beam SQL timestamp has the wrong default: truncation makes the most sense
[ https://issues.apache.org/jira/browse/BEAM-8456?focusedWorklogId=332962=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332962 ] ASF GitHub Bot logged work on BEAM-8456: Author: ASF GitHub Bot Created on: 23/Oct/19 23:21 Start Date: 23/Oct/19 23:21 Worklog Time Spent: 10m Work Description: apilloud commented on pull request #9849: [BEAM-8456] Add pipeline option to have Data Catalog truncate sub-millisecond precision URL: https://github.com/apache/beam/pull/9849#discussion_r338308594 ## File path: sdks/java/extensions/sql/datacatalog/src/main/java/org/apache/beam/sdk/extensions/sql/meta/provider/datacatalog/DataCatalogPipelineOptions.java ## @@ -32,4 +32,12 @@ String getDataCatalogEndpoint(); void setDataCatalogEndpoint(String dataCatalogEndpoint); + + /** Whether to truncate timestamps in tables described by Data Catalog. */ Review comment: Why is this applied at the Data Catalog level and not the Beam SQL level? It seems this is something every table provider should support. (It would be easy to expand later, I'm fine leaving it for now.) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 332962) Time Spent: 1h 40m (was: 1.5h) > BigQuery to Beam SQL timestamp has the wrong default: truncation makes the > most sense > - > > Key: BEAM-8456 > URL: https://issues.apache.org/jira/browse/BEAM-8456 > Project: Beam > Issue Type: Improvement > Components: dsl-sql >Reporter: Kenneth Knowles >Assignee: Kenneth Knowles >Priority: Major > Time Spent: 1h 40m > Remaining Estimate: 0h > > Most of the time, a user reading a timestamp from BigQuery with > higher-than-millisecond precision timestamps may not even realize that the > data source created these high precision timestamps. They are probably > timestamps on log entries generated by a system with higher precision. > If they are using it with Beam SQL, which only supports millisecond > precision, it makes sense to "just work" by default. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-8368) [Python] libprotobuf-generated exception when importing apache_beam
[ https://issues.apache.org/jira/browse/BEAM-8368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brian Hulette updated BEAM-8368: Fix Version/s: (was: 2.17.0) 2.18.0 > [Python] libprotobuf-generated exception when importing apache_beam > --- > > Key: BEAM-8368 > URL: https://issues.apache.org/jira/browse/BEAM-8368 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Affects Versions: 2.15.0, 2.17.0 >Reporter: Ubaier Bhat >Assignee: Brian Hulette >Priority: Blocker > Fix For: 2.18.0 > > Attachments: error_log.txt > > Time Spent: 1h 50m > Remaining Estimate: 0h > > Unable to import apache_beam after upgrading to macos 10.15 (Catalina). > Cleared all the pipenvs and but can't get it working again. > {code} > import apache_beam as beam > /Users/***/.local/share/virtualenvs/beam-etl-ims6DitU/lib/python3.7/site-packages/apache_beam/__init__.py:84: > UserWarning: Some syntactic constructs of Python 3 are not yet fully > supported by Apache Beam. > 'Some syntactic constructs of Python 3 are not yet fully supported by ' > [libprotobuf ERROR google/protobuf/descriptor_database.cc:58] File already > exists in database: > [libprotobuf FATAL google/protobuf/descriptor.cc:1370] CHECK failed: > GeneratedDatabase()->Add(encoded_file_descriptor, size): > libc++abi.dylib: terminating with uncaught exception of type > google::protobuf::FatalException: CHECK failed: > GeneratedDatabase()->Add(encoded_file_descriptor, size): > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7886) Make row coder a standard coder and implement in python
[ https://issues.apache.org/jira/browse/BEAM-7886?focusedWorklogId=332945=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332945 ] ASF GitHub Bot logged work on BEAM-7886: Author: ASF GitHub Bot Created on: 23/Oct/19 22:36 Start Date: 23/Oct/19 22:36 Worklog Time Spent: 10m Work Description: TheNeuralBit commented on issue #9188: [BEAM-7886] Make row coder a standard coder and implement in Python URL: https://github.com/apache/beam/pull/9188#issuecomment-545664933 Run Python PreCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 332945) Time Spent: 14h (was: 13h 50m) > Make row coder a standard coder and implement in python > --- > > Key: BEAM-7886 > URL: https://issues.apache.org/jira/browse/BEAM-7886 > Project: Beam > Issue Type: Improvement > Components: beam-model, sdk-java-core, sdk-py-core >Reporter: Brian Hulette >Assignee: Brian Hulette >Priority: Major > Time Spent: 14h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8451) Interactive Beam example failing from stack overflow
[ https://issues.apache.org/jira/browse/BEAM-8451?focusedWorklogId=332941=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332941 ] ASF GitHub Bot logged work on BEAM-8451: Author: ASF GitHub Bot Created on: 23/Oct/19 22:32 Start Date: 23/Oct/19 22:32 Worklog Time Spent: 10m Work Description: chunyang commented on pull request #9865: [BEAM-8451] Fix interactive beam max recursion err URL: https://github.com/apache/beam/pull/9865 When a pipeline contains a PTransform which passes through its input PCollection during `expand()`, the `PipelineInfo` object may run into an infinite recursion error while constructing the derivation for said PCollection. This PR prevents the infinite recursion by not marking a transform as a producer of PCollection if that PCollection is also an input to the transform. I've added a test for this, but without the fix, the test does not fail 100% of the time--it's dependent on the ordering of transforms in the Pipeline proto. Any advice on how to test this better is appreciated. Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily: - [x] [**Choose reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and mention them in a comment (`R: @username`). R: @aaltay R: @robertwb R: @charlesccychen - [x] Format the pull request title like `[BEAM-XXX] Fixes bug in ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). See the [Contributor Guide](https://beam.apache.org/contribute) for more tips on [how to make review process smoother](https://beam.apache.org/contribute/#make-reviewers-job-easier). Post-Commit Tests Status (on master branch) Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark --- | --- | --- | --- | --- | --- | --- | --- Go | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/) Java | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)[![Build
[jira] [Work logged] (BEAM-8428) [SQL] BigQuery should support project push-down in DIRECT_READ mode
[ https://issues.apache.org/jira/browse/BEAM-8428?focusedWorklogId=332936=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332936 ] ASF GitHub Bot logged work on BEAM-8428: Author: ASF GitHub Bot Created on: 23/Oct/19 22:18 Start Date: 23/Oct/19 22:18 Worklog Time Spent: 10m Work Description: 11moon11 commented on pull request #9864: [BEAM-8428] [SQL] [WIP] BigQuery should return row fields in the selected order URL: https://github.com/apache/beam/pull/9864 This problem occurs when `BeamIOPushDownRule` drops the Calc. BigQueryIO by default projects fields in the order they are present in the schema, but the expected behavior is to project fields in the order they are selected in the SQL statement. Also, `supportsProjects` should only support project push-down in `DIRECT_READ` method. Based on top of #9863. Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily: - [ ] [**Choose reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and mention them in a comment (`R: @username`). - [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). See the [Contributor Guide](https://beam.apache.org/contribute) for more tips on [how to make review process smoother](https://beam.apache.org/contribute/#make-reviewers-job-easier). Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark --- | --- | --- | --- | --- | --- | --- | --- Go | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/) Java | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/) Python | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/)[![Build
[jira] [Commented] (BEAM-8368) [Python] libprotobuf-generated exception when importing apache_beam
[ https://issues.apache.org/jira/browse/BEAM-8368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958282#comment-16958282 ] Ahmet Altay commented on BEAM-8368: --- Let's move to 2.18 and keep watching the pyarrow update. (For reference arrow thread is [https://lists.apache.org/thread.html/462464941d79e67bfef7d73fc900ea5fe0200e8a926253c6eb0285aa@%3Cdev.arrow.apache.org%3E]) If there will be Beam 2.17 RC2 maybe we can reconsider at that time. > [Python] libprotobuf-generated exception when importing apache_beam > --- > > Key: BEAM-8368 > URL: https://issues.apache.org/jira/browse/BEAM-8368 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Affects Versions: 2.15.0, 2.17.0 >Reporter: Ubaier Bhat >Assignee: Brian Hulette >Priority: Blocker > Fix For: 2.17.0 > > Attachments: error_log.txt > > Time Spent: 1h 50m > Remaining Estimate: 0h > > Unable to import apache_beam after upgrading to macos 10.15 (Catalina). > Cleared all the pipenvs and but can't get it working again. > {code} > import apache_beam as beam > /Users/***/.local/share/virtualenvs/beam-etl-ims6DitU/lib/python3.7/site-packages/apache_beam/__init__.py:84: > UserWarning: Some syntactic constructs of Python 3 are not yet fully > supported by Apache Beam. > 'Some syntactic constructs of Python 3 are not yet fully supported by ' > [libprotobuf ERROR google/protobuf/descriptor_database.cc:58] File already > exists in database: > [libprotobuf FATAL google/protobuf/descriptor.cc:1370] CHECK failed: > GeneratedDatabase()->Add(encoded_file_descriptor, size): > libc++abi.dylib: terminating with uncaught exception of type > google::protobuf::FatalException: CHECK failed: > GeneratedDatabase()->Add(encoded_file_descriptor, size): > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7926) Visualize PCollection with Interactive Beam
[ https://issues.apache.org/jira/browse/BEAM-7926?focusedWorklogId=332920=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332920 ] ASF GitHub Bot logged work on BEAM-7926: Author: ASF GitHub Bot Created on: 23/Oct/19 21:50 Start Date: 23/Oct/19 21:50 Worklog Time Spent: 10m Work Description: KevinGG commented on issue #9741: [BEAM-7926] Visualize PCollection URL: https://github.com/apache/beam/pull/9741#issuecomment-545650837 I've added two tox env and corresponding Py36 and Py37 test suite. Removed [interactive] from test required packages. Added it to py2:docs env. I'll add [interactive] for other specific envs defined in tox.ini if any of the gradle task needs it. I've also added checks and warnings in `interactive_environment.InteractiveEnvironment` to check prerequisites for interactive beam (didn't put them in the global scope because global scoped warning level logging will fail gradle tasks in pre-commit): 1. If Python is above 3.5.3; 2. If [interactive] dependencies are fully available; 3. If current runtime is within an interactive environment, i.e., within an ipython kernel. For the `time.sleep()` and `pcoll_visualization` NOOP strategy when dependencies are not fully installed, please let me know your preference, @aaltay. Personally I'm fine with either approaches, and I'll make change if needed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 332920) Time Spent: 13h 50m (was: 13h 40m) > Visualize PCollection with Interactive Beam > --- > > Key: BEAM-7926 > URL: https://issues.apache.org/jira/browse/BEAM-7926 > Project: Beam > Issue Type: New Feature > Components: runner-py-interactive >Reporter: Ning Kang >Assignee: Ning Kang >Priority: Major > Time Spent: 13h 50m > Remaining Estimate: 0h > > Support auto plotting / charting of materialized data of a given PCollection > with Interactive Beam. > Say an Interactive Beam pipeline defined as > p = create_pipeline() > pcoll = p | 'Transform' >> transform() > The use can call a single function and get auto-magical charting of the data > as materialized pcoll. > e.g., visualize(pcoll) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7926) Visualize PCollection with Interactive Beam
[ https://issues.apache.org/jira/browse/BEAM-7926?focusedWorklogId=332918=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332918 ] ASF GitHub Bot logged work on BEAM-7926: Author: ASF GitHub Bot Created on: 23/Oct/19 21:49 Start Date: 23/Oct/19 21:49 Worklog Time Spent: 10m Work Description: KevinGG commented on issue #9741: [BEAM-7926] Visualize PCollection URL: https://github.com/apache/beam/pull/9741#issuecomment-545650837 I've added two tox env and corresponding Py36 and Py37 test suite. Removed [interactive] from test required packages. Added it to py2:docs env. I'll add [interactive] for other specific envs defined in tox.ini if any of the gradle task needs it. I've also added check and warnings in `interactive_environment.InteractiveEnvironment` to check prerequisites for interactive beam (didn't put them in the global scope because global scoped warning level logging will fail gradle tasks in pre-commit): 1. If Python is above 3.5.3; 2. If [interactive] dependencies are fully available; 3. If current runtime is within an interactive environment, i.e., within an ipython kernel. For the `time.sleep()` and `pcoll_visualization` NOOP strategy when dependencies are not fully installed, please let me know your preference, @aaltay. Personally I'm fine with either approaches, and I'll make change if needed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 332918) Time Spent: 13h 40m (was: 13.5h) > Visualize PCollection with Interactive Beam > --- > > Key: BEAM-7926 > URL: https://issues.apache.org/jira/browse/BEAM-7926 > Project: Beam > Issue Type: New Feature > Components: runner-py-interactive >Reporter: Ning Kang >Assignee: Ning Kang >Priority: Major > Time Spent: 13h 40m > Remaining Estimate: 0h > > Support auto plotting / charting of materialized data of a given PCollection > with Interactive Beam. > Say an Interactive Beam pipeline defined as > p = create_pipeline() > pcoll = p | 'Transform' >> transform() > The use can call a single function and get auto-magical charting of the data > as materialized pcoll. > e.g., visualize(pcoll) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (BEAM-7842) Add Python 2 deprecation warnings starting from 2.17.0 release.
[ https://issues.apache.org/jira/browse/BEAM-7842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Valentyn Tymofieiev resolved BEAM-7842. --- Fix Version/s: (was: 2.17.0) 2.16.0 Resolution: Fixed > Add Python 2 deprecation warnings starting from 2.17.0 release. > --- > > Key: BEAM-7842 > URL: https://issues.apache.org/jira/browse/BEAM-7842 > Project: Beam > Issue Type: Sub-task > Components: sdk-py-core >Reporter: Valentyn Tymofieiev >Assignee: Valentyn Tymofieiev >Priority: Minor > Fix For: 2.16.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-7842) Add Python 2 deprecation warnings starting from 2.17.0 release.
[ https://issues.apache.org/jira/browse/BEAM-7842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958269#comment-16958269 ] Valentyn Tymofieiev commented on BEAM-7842: --- We added the warnings starting from 2.16.0. > Add Python 2 deprecation warnings starting from 2.17.0 release. > --- > > Key: BEAM-7842 > URL: https://issues.apache.org/jira/browse/BEAM-7842 > Project: Beam > Issue Type: Sub-task > Components: sdk-py-core >Reporter: Valentyn Tymofieiev >Assignee: Valentyn Tymofieiev >Priority: Minor > Fix For: 2.17.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8341) basic bundling support for samza portable runner
[ https://issues.apache.org/jira/browse/BEAM-8341?focusedWorklogId=332913=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332913 ] ASF GitHub Bot logged work on BEAM-8341: Author: ASF GitHub Bot Created on: 23/Oct/19 21:29 Start Date: 23/Oct/19 21:29 Worklog Time Spent: 10m Work Description: xinyuiscool commented on pull request #9777: [BEAM-8341]: basic bundling support for portable runner URL: https://github.com/apache/beam/pull/9777#discussion_r338287932 ## File path: runners/samza/src/main/java/org/apache/beam/runners/samza/translation/ParDoBoundMultiTranslator.java ## @@ -76,11 +79,36 @@ doFnInvokerRegistrar = invokerReg.hasNext() ? Iterators.getOnlyElement(invokerReg) : null; } + /* + * Perform some common pipeline option validation. Bundling related logic for now + */ + private static void validatePipelineOption( Review comment: we actually have a SamzaPipelineOPtionsValidator. Can we move the logic there? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 332913) Time Spent: 1h 50m (was: 1h 40m) > basic bundling support for samza portable runner > > > Key: BEAM-8341 > URL: https://issues.apache.org/jira/browse/BEAM-8341 > Project: Beam > Issue Type: Task > Components: runner-samza >Reporter: Hai Lu >Assignee: Hai Lu >Priority: Major > Time Spent: 1h 50m > Remaining Estimate: 0h > > bundling support for samza portable runner -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8446) apache_beam.io.gcp.bigquery_write_it_test.BigQueryWriteIntegrationTests.test_big_query_write_new_types is flaky
[ https://issues.apache.org/jira/browse/BEAM-8446?focusedWorklogId=332905=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332905 ] ASF GitHub Bot logged work on BEAM-8446: Author: ASF GitHub Bot Created on: 23/Oct/19 21:22 Start Date: 23/Oct/19 21:22 Worklog Time Spent: 10m Work Description: pabloem commented on issue #9855: [BEAM-8446] Retrying BQ query on timeouts URL: https://github.com/apache/beam/pull/9855#issuecomment-545642130 Run Python 3.7 PostCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 332905) Time Spent: 2h 50m (was: 2h 40m) > apache_beam.io.gcp.bigquery_write_it_test.BigQueryWriteIntegrationTests.test_big_query_write_new_types > is flaky > --- > > Key: BEAM-8446 > URL: https://issues.apache.org/jira/browse/BEAM-8446 > Project: Beam > Issue Type: New Feature > Components: test-failures >Reporter: Boyuan Zhang >Assignee: Pablo Estrada >Priority: Major > Time Spent: 2h 50m > Remaining Estimate: 0h > > test_big_query_write_new_types appears to be flaky in > beam_PostCommit_Python37 test suite. > https://builds.apache.org/job/beam_PostCommit_Python37/733/ > https://builds.apache.org/job/beam_PostCommit_Python37/739/ -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7926) Visualize PCollection with Interactive Beam
[ https://issues.apache.org/jira/browse/BEAM-7926?focusedWorklogId=332904=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332904 ] ASF GitHub Bot logged work on BEAM-7926: Author: ASF GitHub Bot Created on: 23/Oct/19 21:21 Start Date: 23/Oct/19 21:21 Worklog Time Spent: 10m Work Description: pabloem commented on issue #9771: [BEAM-7926] Update dependencies in Java Katas URL: https://github.com/apache/beam/pull/9771#issuecomment-545641711 Alright. LGTM. I'll just ping @henryken once more, and merge in a couple days if he doesn't answer. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 332904) Time Spent: 13.5h (was: 13h 20m) > Visualize PCollection with Interactive Beam > --- > > Key: BEAM-7926 > URL: https://issues.apache.org/jira/browse/BEAM-7926 > Project: Beam > Issue Type: New Feature > Components: runner-py-interactive >Reporter: Ning Kang >Assignee: Ning Kang >Priority: Major > Time Spent: 13.5h > Remaining Estimate: 0h > > Support auto plotting / charting of materialized data of a given PCollection > with Interactive Beam. > Say an Interactive Beam pipeline defined as > p = create_pipeline() > pcoll = p | 'Transform' >> transform() > The use can call a single function and get auto-magical charting of the data > as materialized pcoll. > e.g., visualize(pcoll) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-4775) JobService should support returning metrics
[ https://issues.apache.org/jira/browse/BEAM-4775?focusedWorklogId=332896=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332896 ] ASF GitHub Bot logged work on BEAM-4775: Author: ASF GitHub Bot Created on: 23/Oct/19 21:12 Start Date: 23/Oct/19 21:12 Worklog Time Spent: 10m Work Description: ajamato commented on pull request #9843: [BEAM-4775] Converting MonitoringInfos to MetricResults in PortableRunner URL: https://github.com/apache/beam/pull/9843#discussion_r338280812 ## File path: sdks/python/apache_beam/testing/load_tests/load_test.py ## @@ -52,32 +52,24 @@ def setUp(self): self.input_options = json.loads(self.pipeline.get_option('input_options')) self.project_id = self.pipeline.get_option('project') -self.publish_to_big_query = self.pipeline.get_option('publish_to_big_query') self.metrics_dataset = self.pipeline.get_option('metrics_dataset') self.metrics_namespace = self.pipeline.get_option('metrics_table') -if not self.are_metrics_collected(): - logging.info('Metrics will not be collected') - self.metrics_monitor = None -else: - self.metrics_monitor = MetricsReader( - project_name=self.project_id, - bq_table=self.metrics_namespace, - bq_dataset=self.metrics_dataset, - ) +self.metrics_monitor = MetricsReader( Review comment: I am not familiar with MetricsReader This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 332896) Time Spent: 51h 20m (was: 51h 10m) > JobService should support returning metrics > --- > > Key: BEAM-4775 > URL: https://issues.apache.org/jira/browse/BEAM-4775 > Project: Beam > Issue Type: Bug > Components: beam-model >Reporter: Eugene Kirpichov >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 51h 20m > Remaining Estimate: 0h > > Design doc: [https://s.apache.org/get-metrics-api]. > Further discussion is ongoing on [this > doc|https://docs.google.com/document/d/1m83TsFvJbOlcLfXVXprQm1B7vUakhbLZMzuRrOHWnTg/edit?ts=5c826bb4#heading=h.faqan9rjc6dm]. > We want to report job metrics back to the portability harness from the runner > harness, for displaying to users. > h1. Relevant PRs in flight: > h2. Ready for Review: > * [#8022|https://github.com/apache/beam/pull/8022]: correct the Job RPC > protos from [#8018|https://github.com/apache/beam/pull/8018]. > h2. Iterating / Discussing: > * [#7971|https://github.com/apache/beam/pull/7971]: Flink portable metrics: > get ptransform from MonitoringInfo, not stage name > ** this is a simpler, Flink-specific PR that is basically duplicated inside > each of the following two, so may be worth trying to merge in first > * #[7915|https://github.com/apache/beam/pull/7915]: use MonitoringInfo data > model in Java SDK metrics > * [#7868|https://github.com/apache/beam/pull/7868]: MonitoringInfo URN tweaks > h2. Merged > * [#8018|https://github.com/apache/beam/pull/8018]: add job metrics RPC > protos > * [#7867|https://github.com/apache/beam/pull/7867]: key MetricResult by a > MetricKey > * [#7938|https://github.com/apache/beam/pull/7938]: move MonitoringInfo > protos to model/pipeline module > * [#7883|https://github.com/apache/beam/pull/7883]: Add > MetricQueryResults.allMetrics() helper > * [#7866|https://github.com/apache/beam/pull/7866]: move function helpers > from fn-harness to sdks/java/core > * [#7890|https://github.com/apache/beam/pull/7890]: consolidate MetricResult > implementations > h2. Closed > * [#7934|https://github.com/apache/beam/pull/7934]: job metrics RPC + SDK > support > * [#7876|https://github.com/apache/beam/pull/7876]: Clean up metric protos; > support integer distributions, gauges -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-4775) JobService should support returning metrics
[ https://issues.apache.org/jira/browse/BEAM-4775?focusedWorklogId=332895=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332895 ] ASF GitHub Bot logged work on BEAM-4775: Author: ASF GitHub Bot Created on: 23/Oct/19 21:12 Start Date: 23/Oct/19 21:12 Worklog Time Spent: 10m Work Description: ajamato commented on pull request #9843: [BEAM-4775] Converting MonitoringInfos to MetricResults in PortableRunner URL: https://github.com/apache/beam/pull/9843#discussion_r33828 ## File path: sdks/python/apache_beam/runners/portability/portable_runner.py ## @@ -361,16 +363,31 @@ def add_runner_options(parser): state_stream, cleanup_callbacks) -class PortableMetrics(metrics.metric.MetricResults): +class PortableMetrics(metric.MetricResults, + portable_metrics.ParseMonitoringInfoMixin): Review comment: Please remove the use of double inheritance. You can use the methods from portable_metrics.ParseMonitoringInfoMixin by making that file define the methods in the module instead of a class, since it only defined static methods This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 332895) Time Spent: 51h 10m (was: 51h) > JobService should support returning metrics > --- > > Key: BEAM-4775 > URL: https://issues.apache.org/jira/browse/BEAM-4775 > Project: Beam > Issue Type: Bug > Components: beam-model >Reporter: Eugene Kirpichov >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 51h 10m > Remaining Estimate: 0h > > Design doc: [https://s.apache.org/get-metrics-api]. > Further discussion is ongoing on [this > doc|https://docs.google.com/document/d/1m83TsFvJbOlcLfXVXprQm1B7vUakhbLZMzuRrOHWnTg/edit?ts=5c826bb4#heading=h.faqan9rjc6dm]. > We want to report job metrics back to the portability harness from the runner > harness, for displaying to users. > h1. Relevant PRs in flight: > h2. Ready for Review: > * [#8022|https://github.com/apache/beam/pull/8022]: correct the Job RPC > protos from [#8018|https://github.com/apache/beam/pull/8018]. > h2. Iterating / Discussing: > * [#7971|https://github.com/apache/beam/pull/7971]: Flink portable metrics: > get ptransform from MonitoringInfo, not stage name > ** this is a simpler, Flink-specific PR that is basically duplicated inside > each of the following two, so may be worth trying to merge in first > * #[7915|https://github.com/apache/beam/pull/7915]: use MonitoringInfo data > model in Java SDK metrics > * [#7868|https://github.com/apache/beam/pull/7868]: MonitoringInfo URN tweaks > h2. Merged > * [#8018|https://github.com/apache/beam/pull/8018]: add job metrics RPC > protos > * [#7867|https://github.com/apache/beam/pull/7867]: key MetricResult by a > MetricKey > * [#7938|https://github.com/apache/beam/pull/7938]: move MonitoringInfo > protos to model/pipeline module > * [#7883|https://github.com/apache/beam/pull/7883]: Add > MetricQueryResults.allMetrics() helper > * [#7866|https://github.com/apache/beam/pull/7866]: move function helpers > from fn-harness to sdks/java/core > * [#7890|https://github.com/apache/beam/pull/7890]: consolidate MetricResult > implementations > h2. Closed > * [#7934|https://github.com/apache/beam/pull/7934]: job metrics RPC + SDK > support > * [#7876|https://github.com/apache/beam/pull/7876]: Clean up metric protos; > support integer distributions, gauges -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-4775) JobService should support returning metrics
[ https://issues.apache.org/jira/browse/BEAM-4775?focusedWorklogId=332894=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332894 ] ASF GitHub Bot logged work on BEAM-4775: Author: ASF GitHub Bot Created on: 23/Oct/19 21:12 Start Date: 23/Oct/19 21:12 Worklog Time Spent: 10m Work Description: ajamato commented on pull request #9843: [BEAM-4775] Converting MonitoringInfos to MetricResults in PortableRunner URL: https://github.com/apache/beam/pull/9843#discussion_r338278403 ## File path: sdks/python/apache_beam/runners/portability/portable_metrics.py ## @@ -0,0 +1,74 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +from __future__ import absolute_import + +from apache_beam.metrics import monitoring_infos +from apache_beam.metrics.execution import MetricKey +from apache_beam.metrics.metric import MetricName + + +class ParseMonitoringInfoMixin(object): + @staticmethod + def from_monitoring_infos(monitoring_info_list, user_metrics_only=False): +"""Groups MonitoringInfo objects into counters, distributions and gauges. + +Args: + monitoring_info_list: An iterable of MonitoringInfo objects. + user_metrics_only: If true, includes user metrics only. Review comment: Please add a Returns section to pydoc comment This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 332894) Time Spent: 51h (was: 50h 50m) > JobService should support returning metrics > --- > > Key: BEAM-4775 > URL: https://issues.apache.org/jira/browse/BEAM-4775 > Project: Beam > Issue Type: Bug > Components: beam-model >Reporter: Eugene Kirpichov >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 51h > Remaining Estimate: 0h > > Design doc: [https://s.apache.org/get-metrics-api]. > Further discussion is ongoing on [this > doc|https://docs.google.com/document/d/1m83TsFvJbOlcLfXVXprQm1B7vUakhbLZMzuRrOHWnTg/edit?ts=5c826bb4#heading=h.faqan9rjc6dm]. > We want to report job metrics back to the portability harness from the runner > harness, for displaying to users. > h1. Relevant PRs in flight: > h2. Ready for Review: > * [#8022|https://github.com/apache/beam/pull/8022]: correct the Job RPC > protos from [#8018|https://github.com/apache/beam/pull/8018]. > h2. Iterating / Discussing: > * [#7971|https://github.com/apache/beam/pull/7971]: Flink portable metrics: > get ptransform from MonitoringInfo, not stage name > ** this is a simpler, Flink-specific PR that is basically duplicated inside > each of the following two, so may be worth trying to merge in first > * #[7915|https://github.com/apache/beam/pull/7915]: use MonitoringInfo data > model in Java SDK metrics > * [#7868|https://github.com/apache/beam/pull/7868]: MonitoringInfo URN tweaks > h2. Merged > * [#8018|https://github.com/apache/beam/pull/8018]: add job metrics RPC > protos > * [#7867|https://github.com/apache/beam/pull/7867]: key MetricResult by a > MetricKey > * [#7938|https://github.com/apache/beam/pull/7938]: move MonitoringInfo > protos to model/pipeline module > * [#7883|https://github.com/apache/beam/pull/7883]: Add > MetricQueryResults.allMetrics() helper > * [#7866|https://github.com/apache/beam/pull/7866]: move function helpers > from fn-harness to sdks/java/core > * [#7890|https://github.com/apache/beam/pull/7890]: consolidate MetricResult > implementations > h2. Closed > * [#7934|https://github.com/apache/beam/pull/7934]: job metrics RPC + SDK > support > * [#7876|https://github.com/apache/beam/pull/7876]: Clean up metric protos; > support integer distributions, gauges -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-4775) JobService should support returning metrics
[ https://issues.apache.org/jira/browse/BEAM-4775?focusedWorklogId=332898=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332898 ] ASF GitHub Bot logged work on BEAM-4775: Author: ASF GitHub Bot Created on: 23/Oct/19 21:12 Start Date: 23/Oct/19 21:12 Worklog Time Spent: 10m Work Description: ajamato commented on pull request #9843: [BEAM-4775] Converting MonitoringInfos to MetricResults in PortableRunner URL: https://github.com/apache/beam/pull/9843#discussion_r338279475 ## File path: sdks/python/apache_beam/runners/portability/portable_metrics.py ## @@ -0,0 +1,74 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +from __future__ import absolute_import + +from apache_beam.metrics import monitoring_infos +from apache_beam.metrics.execution import MetricKey +from apache_beam.metrics.metric import MetricName + + +class ParseMonitoringInfoMixin(object): + @staticmethod + def from_monitoring_infos(monitoring_info_list, user_metrics_only=False): +"""Groups MonitoringInfo objects into counters, distributions and gauges. + +Args: + monitoring_info_list: An iterable of MonitoringInfo objects. + user_metrics_only: If true, includes user metrics only. +""" +counters = {} +distributions = {} +gauges = {} + +for mi in monitoring_info_list: + if (user_metrics_only and + not monitoring_infos.is_user_monitoring_info(mi)): +continue + + key = ParseMonitoringInfoMixin._create_metric_key(mi) + metric_result = (monitoring_infos.extract_metric_result_map_value(mi)) + + if monitoring_infos.is_counter(mi): +counters[key] = metric_result + elif monitoring_infos.is_distribution(mi): +distributions[key] = metric_result + elif monitoring_infos.is_gauge(mi): +gauges[key] = metric_result + +return counters, distributions, gauges + + @staticmethod + def _create_metric_key(monitoring_info): +step_name = ParseMonitoringInfoMixin._get_step_name(monitoring_info) +namespace, name = monitoring_infos.parse_namespace_and_name(monitoring_info) +return MetricKey(step_name, MetricName(namespace, name)) + + @staticmethod + def _get_step_name(monitoring_info): +keys_to_check = [monitoring_infos.PTRANSFORM_LABEL, Review comment: This doesn't seem correct. Why would you consider the step name to be under labels other than PTRANSFORM_LABEL see montioring_info specs defined here https://github.com/apache/beam/blob/d4afbabf38a3ab557625c9c091ed5f06ca6731ce/model/pipeline/src/main/proto/metrics.proto PTRANSFORM_LABEL is the only one used for this purpose This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 332898) Time Spent: 51.5h (was: 51h 20m) > JobService should support returning metrics > --- > > Key: BEAM-4775 > URL: https://issues.apache.org/jira/browse/BEAM-4775 > Project: Beam > Issue Type: Bug > Components: beam-model >Reporter: Eugene Kirpichov >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 51.5h > Remaining Estimate: 0h > > Design doc: [https://s.apache.org/get-metrics-api]. > Further discussion is ongoing on [this > doc|https://docs.google.com/document/d/1m83TsFvJbOlcLfXVXprQm1B7vUakhbLZMzuRrOHWnTg/edit?ts=5c826bb4#heading=h.faqan9rjc6dm]. > We want to report job metrics back to the portability harness from the runner > harness, for displaying to users. > h1. Relevant PRs in flight: > h2. Ready for Review: > * [#8022|https://github.com/apache/beam/pull/8022]: correct the Job RPC > protos from [#8018|https://github.com/apache/beam/pull/8018]. > h2. Iterating / Discussing: > *
[jira] [Work logged] (BEAM-4775) JobService should support returning metrics
[ https://issues.apache.org/jira/browse/BEAM-4775?focusedWorklogId=332899=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332899 ] ASF GitHub Bot logged work on BEAM-4775: Author: ASF GitHub Bot Created on: 23/Oct/19 21:12 Start Date: 23/Oct/19 21:12 Worklog Time Spent: 10m Work Description: ajamato commented on pull request #9843: [BEAM-4775] Converting MonitoringInfos to MetricResults in PortableRunner URL: https://github.com/apache/beam/pull/9843#discussion_r338280472 ## File path: sdks/python/apache_beam/runners/portability/portable_metrics.py ## @@ -0,0 +1,74 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +from __future__ import absolute_import + +from apache_beam.metrics import monitoring_infos +from apache_beam.metrics.execution import MetricKey +from apache_beam.metrics.metric import MetricName + + +class ParseMonitoringInfoMixin(object): Review comment: Please remove this class and instead define the methods on a module, without a class. All the methods here are static, so there is no need for a class/object instance, This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 332899) > JobService should support returning metrics > --- > > Key: BEAM-4775 > URL: https://issues.apache.org/jira/browse/BEAM-4775 > Project: Beam > Issue Type: Bug > Components: beam-model >Reporter: Eugene Kirpichov >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 51.5h > Remaining Estimate: 0h > > Design doc: [https://s.apache.org/get-metrics-api]. > Further discussion is ongoing on [this > doc|https://docs.google.com/document/d/1m83TsFvJbOlcLfXVXprQm1B7vUakhbLZMzuRrOHWnTg/edit?ts=5c826bb4#heading=h.faqan9rjc6dm]. > We want to report job metrics back to the portability harness from the runner > harness, for displaying to users. > h1. Relevant PRs in flight: > h2. Ready for Review: > * [#8022|https://github.com/apache/beam/pull/8022]: correct the Job RPC > protos from [#8018|https://github.com/apache/beam/pull/8018]. > h2. Iterating / Discussing: > * [#7971|https://github.com/apache/beam/pull/7971]: Flink portable metrics: > get ptransform from MonitoringInfo, not stage name > ** this is a simpler, Flink-specific PR that is basically duplicated inside > each of the following two, so may be worth trying to merge in first > * #[7915|https://github.com/apache/beam/pull/7915]: use MonitoringInfo data > model in Java SDK metrics > * [#7868|https://github.com/apache/beam/pull/7868]: MonitoringInfo URN tweaks > h2. Merged > * [#8018|https://github.com/apache/beam/pull/8018]: add job metrics RPC > protos > * [#7867|https://github.com/apache/beam/pull/7867]: key MetricResult by a > MetricKey > * [#7938|https://github.com/apache/beam/pull/7938]: move MonitoringInfo > protos to model/pipeline module > * [#7883|https://github.com/apache/beam/pull/7883]: Add > MetricQueryResults.allMetrics() helper > * [#7866|https://github.com/apache/beam/pull/7866]: move function helpers > from fn-harness to sdks/java/core > * [#7890|https://github.com/apache/beam/pull/7890]: consolidate MetricResult > implementations > h2. Closed > * [#7934|https://github.com/apache/beam/pull/7934]: job metrics RPC + SDK > support > * [#7876|https://github.com/apache/beam/pull/7876]: Clean up metric protos; > support integer distributions, gauges -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-4775) JobService should support returning metrics
[ https://issues.apache.org/jira/browse/BEAM-4775?focusedWorklogId=332897=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332897 ] ASF GitHub Bot logged work on BEAM-4775: Author: ASF GitHub Bot Created on: 23/Oct/19 21:12 Start Date: 23/Oct/19 21:12 Worklog Time Spent: 10m Work Description: ajamato commented on pull request #9843: [BEAM-4775] Converting MonitoringInfos to MetricResults in PortableRunner URL: https://github.com/apache/beam/pull/9843#discussion_r338278677 ## File path: sdks/python/apache_beam/runners/portability/portable_metrics.py ## @@ -0,0 +1,74 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +from __future__ import absolute_import + +from apache_beam.metrics import monitoring_infos +from apache_beam.metrics.execution import MetricKey +from apache_beam.metrics.metric import MetricName + + +class ParseMonitoringInfoMixin(object): + @staticmethod + def from_monitoring_infos(monitoring_info_list, user_metrics_only=False): +"""Groups MonitoringInfo objects into counters, distributions and gauges. + +Args: + monitoring_info_list: An iterable of MonitoringInfo objects. + user_metrics_only: If true, includes user metrics only. +""" +counters = {} +distributions = {} +gauges = {} + +for mi in monitoring_info_list: + if (user_metrics_only and + not monitoring_infos.is_user_monitoring_info(mi)): +continue + + key = ParseMonitoringInfoMixin._create_metric_key(mi) + metric_result = (monitoring_infos.extract_metric_result_map_value(mi)) + + if monitoring_infos.is_counter(mi): +counters[key] = metric_result + elif monitoring_infos.is_distribution(mi): +distributions[key] = metric_result + elif monitoring_infos.is_gauge(mi): +gauges[key] = metric_result + +return counters, distributions, gauges + + @staticmethod + def _create_metric_key(monitoring_info): +step_name = ParseMonitoringInfoMixin._get_step_name(monitoring_info) Review comment: _get_step_name seems like it should live on monitoring_infos.py This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 332897) Time Spent: 51.5h (was: 51h 20m) > JobService should support returning metrics > --- > > Key: BEAM-4775 > URL: https://issues.apache.org/jira/browse/BEAM-4775 > Project: Beam > Issue Type: Bug > Components: beam-model >Reporter: Eugene Kirpichov >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 51.5h > Remaining Estimate: 0h > > Design doc: [https://s.apache.org/get-metrics-api]. > Further discussion is ongoing on [this > doc|https://docs.google.com/document/d/1m83TsFvJbOlcLfXVXprQm1B7vUakhbLZMzuRrOHWnTg/edit?ts=5c826bb4#heading=h.faqan9rjc6dm]. > We want to report job metrics back to the portability harness from the runner > harness, for displaying to users. > h1. Relevant PRs in flight: > h2. Ready for Review: > * [#8022|https://github.com/apache/beam/pull/8022]: correct the Job RPC > protos from [#8018|https://github.com/apache/beam/pull/8018]. > h2. Iterating / Discussing: > * [#7971|https://github.com/apache/beam/pull/7971]: Flink portable metrics: > get ptransform from MonitoringInfo, not stage name > ** this is a simpler, Flink-specific PR that is basically duplicated inside > each of the following two, so may be worth trying to merge in first > * #[7915|https://github.com/apache/beam/pull/7915]: use MonitoringInfo data > model in Java SDK metrics > * [#7868|https://github.com/apache/beam/pull/7868]: MonitoringInfo URN tweaks > h2. Merged > * [#8018|https://github.com/apache/beam/pull/8018]: add job metrics RPC >
[jira] [Assigned] (BEAM-8459) Samza runner fails UsesStrictTimerOrdering category tests
[ https://issues.apache.org/jira/browse/BEAM-8459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinyu Liu reassigned BEAM-8459: --- Assignee: Xinyu Liu > Samza runner fails UsesStrictTimerOrdering category tests > - > > Key: BEAM-8459 > URL: https://issues.apache.org/jira/browse/BEAM-8459 > Project: Beam > Issue Type: Bug > Components: runner-samza >Affects Versions: 2.17.0 >Reporter: Jan Lukavský >Assignee: Xinyu Liu >Priority: Major > > BEAM-7520 introduced new set of validatesRunner tests that test that timers > are fired exactly in order of increasing timestamp. Samza runner fails these > added tests (are currently ignored). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-4775) JobService should support returning metrics
[ https://issues.apache.org/jira/browse/BEAM-4775?focusedWorklogId=332884=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332884 ] ASF GitHub Bot logged work on BEAM-4775: Author: ASF GitHub Bot Created on: 23/Oct/19 21:01 Start Date: 23/Oct/19 21:01 Worklog Time Spent: 10m Work Description: ajamato commented on pull request #9843: [BEAM-4775] Converting MonitoringInfos to MetricResults in PortableRunner URL: https://github.com/apache/beam/pull/9843#discussion_r338276288 ## File path: sdks/python/apache_beam/runners/portability/local_job_service.py ## @@ -107,6 +108,26 @@ def stop(self, timeout=1): if os.path.exists(self._staging_dir) and self._cleanup_staging_dir: shutil.rmtree(self._staging_dir, ignore_errors=True) + def GetJobMetrics(self, request, context=None): +if request.job_id not in self._jobs: + raise LookupError("Job {} does not exist".format(request.job_id)) + +result = self._jobs[request.job_id].result +monitoring_info_list = [] +for mi in result._monitoring_infos_by_stage.values(): + monitoring_info_list.extend(mi) + +# Filter out system metrics +user_monitoring_info_list = [ +x for x in monitoring_info_list +if monitoring_infos._is_user_monitoring_info(x) or +monitoring_infos._is_user_distribution_monitoring_info(x) +] + +return beam_job_api_pb2.GetJobMetricsResponse( Review comment: FYI, here is some background. I was not aware that MetricResult existed in proto form now. Creating a proto was suggested but not pursued at the time. https://s.apache.org/get-metrics-api When I last worked here MetricResult did not have a proto format. The plan was to use MonitoringInfos for a language agnostic format, and then MetricResult would be a language specific format (python, java, go, etc.) Each runner should provide a way to return MonitoringInfos, and each language would have a library to convert MonitoringInfos to MetricResult protos Its hard for me to review, I don't think I am up to date on whatever the current plan/usage of these protos are. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 332884) Time Spent: 50h 40m (was: 50.5h) > JobService should support returning metrics > --- > > Key: BEAM-4775 > URL: https://issues.apache.org/jira/browse/BEAM-4775 > Project: Beam > Issue Type: Bug > Components: beam-model >Reporter: Eugene Kirpichov >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 50h 40m > Remaining Estimate: 0h > > Design doc: [https://s.apache.org/get-metrics-api]. > Further discussion is ongoing on [this > doc|https://docs.google.com/document/d/1m83TsFvJbOlcLfXVXprQm1B7vUakhbLZMzuRrOHWnTg/edit?ts=5c826bb4#heading=h.faqan9rjc6dm]. > We want to report job metrics back to the portability harness from the runner > harness, for displaying to users. > h1. Relevant PRs in flight: > h2. Ready for Review: > * [#8022|https://github.com/apache/beam/pull/8022]: correct the Job RPC > protos from [#8018|https://github.com/apache/beam/pull/8018]. > h2. Iterating / Discussing: > * [#7971|https://github.com/apache/beam/pull/7971]: Flink portable metrics: > get ptransform from MonitoringInfo, not stage name > ** this is a simpler, Flink-specific PR that is basically duplicated inside > each of the following two, so may be worth trying to merge in first > * #[7915|https://github.com/apache/beam/pull/7915]: use MonitoringInfo data > model in Java SDK metrics > * [#7868|https://github.com/apache/beam/pull/7868]: MonitoringInfo URN tweaks > h2. Merged > * [#8018|https://github.com/apache/beam/pull/8018]: add job metrics RPC > protos > * [#7867|https://github.com/apache/beam/pull/7867]: key MetricResult by a > MetricKey > * [#7938|https://github.com/apache/beam/pull/7938]: move MonitoringInfo > protos to model/pipeline module > * [#7883|https://github.com/apache/beam/pull/7883]: Add > MetricQueryResults.allMetrics() helper > * [#7866|https://github.com/apache/beam/pull/7866]: move function helpers > from fn-harness to sdks/java/core > * [#7890|https://github.com/apache/beam/pull/7890]: consolidate MetricResult > implementations > h2. Closed > * [#7934|https://github.com/apache/beam/pull/7934]: job metrics RPC + SDK > support > * [#7876|https://github.com/apache/beam/pull/7876]: Clean up metric protos; > support integer
[jira] [Work logged] (BEAM-4775) JobService should support returning metrics
[ https://issues.apache.org/jira/browse/BEAM-4775?focusedWorklogId=332886=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332886 ] ASF GitHub Bot logged work on BEAM-4775: Author: ASF GitHub Bot Created on: 23/Oct/19 21:01 Start Date: 23/Oct/19 21:01 Worklog Time Spent: 10m Work Description: ajamato commented on pull request #9843: [BEAM-4775] Converting MonitoringInfos to MetricResults in PortableRunner URL: https://github.com/apache/beam/pull/9843#discussion_r338276288 ## File path: sdks/python/apache_beam/runners/portability/local_job_service.py ## @@ -107,6 +108,26 @@ def stop(self, timeout=1): if os.path.exists(self._staging_dir) and self._cleanup_staging_dir: shutil.rmtree(self._staging_dir, ignore_errors=True) + def GetJobMetrics(self, request, context=None): +if request.job_id not in self._jobs: + raise LookupError("Job {} does not exist".format(request.job_id)) + +result = self._jobs[request.job_id].result +monitoring_info_list = [] +for mi in result._monitoring_infos_by_stage.values(): + monitoring_info_list.extend(mi) + +# Filter out system metrics +user_monitoring_info_list = [ +x for x in monitoring_info_list +if monitoring_infos._is_user_monitoring_info(x) or +monitoring_infos._is_user_distribution_monitoring_info(x) +] + +return beam_job_api_pb2.GetJobMetricsResponse( Review comment: FYI, here is some background. I was not aware that MetricResult existed in proto form now. Creating a proto was suggested but not pursued at the time. https://s.apache.org/get-metrics-api When I last worked here MetricResult did not have a proto format. The plan was to use MonitoringInfos for a language agnostic format, and then MetricResult would be a language specific format (python, java, go, etc.) Each runner should provide a way to return MonitoringInfos, and each language would have a library to convert MonitoringInfos to MetricResult protos It seems like you might be using a different approach, to use MetricResult protos as the language agnostic solution Its hard for me to review, I don't think I am up to date on whatever the current plan/usage of these protos are. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 332886) Time Spent: 50h 50m (was: 50h 40m) > JobService should support returning metrics > --- > > Key: BEAM-4775 > URL: https://issues.apache.org/jira/browse/BEAM-4775 > Project: Beam > Issue Type: Bug > Components: beam-model >Reporter: Eugene Kirpichov >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 50h 50m > Remaining Estimate: 0h > > Design doc: [https://s.apache.org/get-metrics-api]. > Further discussion is ongoing on [this > doc|https://docs.google.com/document/d/1m83TsFvJbOlcLfXVXprQm1B7vUakhbLZMzuRrOHWnTg/edit?ts=5c826bb4#heading=h.faqan9rjc6dm]. > We want to report job metrics back to the portability harness from the runner > harness, for displaying to users. > h1. Relevant PRs in flight: > h2. Ready for Review: > * [#8022|https://github.com/apache/beam/pull/8022]: correct the Job RPC > protos from [#8018|https://github.com/apache/beam/pull/8018]. > h2. Iterating / Discussing: > * [#7971|https://github.com/apache/beam/pull/7971]: Flink portable metrics: > get ptransform from MonitoringInfo, not stage name > ** this is a simpler, Flink-specific PR that is basically duplicated inside > each of the following two, so may be worth trying to merge in first > * #[7915|https://github.com/apache/beam/pull/7915]: use MonitoringInfo data > model in Java SDK metrics > * [#7868|https://github.com/apache/beam/pull/7868]: MonitoringInfo URN tweaks > h2. Merged > * [#8018|https://github.com/apache/beam/pull/8018]: add job metrics RPC > protos > * [#7867|https://github.com/apache/beam/pull/7867]: key MetricResult by a > MetricKey > * [#7938|https://github.com/apache/beam/pull/7938]: move MonitoringInfo > protos to model/pipeline module > * [#7883|https://github.com/apache/beam/pull/7883]: Add > MetricQueryResults.allMetrics() helper > * [#7866|https://github.com/apache/beam/pull/7866]: move function helpers > from fn-harness to sdks/java/core > * [#7890|https://github.com/apache/beam/pull/7890]: consolidate MetricResult > implementations > h2. Closed > * [#7934|https://github.com/apache/beam/pull/7934]: job metrics
[jira] [Work logged] (BEAM-4775) JobService should support returning metrics
[ https://issues.apache.org/jira/browse/BEAM-4775?focusedWorklogId=332883=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332883 ] ASF GitHub Bot logged work on BEAM-4775: Author: ASF GitHub Bot Created on: 23/Oct/19 21:00 Start Date: 23/Oct/19 21:00 Worklog Time Spent: 10m Work Description: ajamato commented on pull request #9843: [BEAM-4775] Converting MonitoringInfos to MetricResults in PortableRunner URL: https://github.com/apache/beam/pull/9843#discussion_r338276288 ## File path: sdks/python/apache_beam/runners/portability/local_job_service.py ## @@ -107,6 +108,26 @@ def stop(self, timeout=1): if os.path.exists(self._staging_dir) and self._cleanup_staging_dir: shutil.rmtree(self._staging_dir, ignore_errors=True) + def GetJobMetrics(self, request, context=None): +if request.job_id not in self._jobs: + raise LookupError("Job {} does not exist".format(request.job_id)) + +result = self._jobs[request.job_id].result +monitoring_info_list = [] +for mi in result._monitoring_infos_by_stage.values(): + monitoring_info_list.extend(mi) + +# Filter out system metrics +user_monitoring_info_list = [ +x for x in monitoring_info_list +if monitoring_infos._is_user_monitoring_info(x) or +monitoring_infos._is_user_distribution_monitoring_info(x) +] + +return beam_job_api_pb2.GetJobMetricsResponse( Review comment: FYI, here is some background. I was not aware that these existed in proto form now. Creating a proto was suggested but not pursued at the time. https://s.apache.org/get-metrics-api When I last worked here MetricResult did not have a proto format. The plan was to use MonitoringInfos for a language agnostic format, and then MetricResult would be a language specific format (python, java, go, etc.) Each runner should provide a way to return MonitoringInfos, and each language would have a library to convert MonitoringInfos to MetricResult protos Its hard for me to review, I don't think I am up to date on whatever the current plan/usage of these protos are. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 332883) Time Spent: 50.5h (was: 50h 20m) > JobService should support returning metrics > --- > > Key: BEAM-4775 > URL: https://issues.apache.org/jira/browse/BEAM-4775 > Project: Beam > Issue Type: Bug > Components: beam-model >Reporter: Eugene Kirpichov >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 50.5h > Remaining Estimate: 0h > > Design doc: [https://s.apache.org/get-metrics-api]. > Further discussion is ongoing on [this > doc|https://docs.google.com/document/d/1m83TsFvJbOlcLfXVXprQm1B7vUakhbLZMzuRrOHWnTg/edit?ts=5c826bb4#heading=h.faqan9rjc6dm]. > We want to report job metrics back to the portability harness from the runner > harness, for displaying to users. > h1. Relevant PRs in flight: > h2. Ready for Review: > * [#8022|https://github.com/apache/beam/pull/8022]: correct the Job RPC > protos from [#8018|https://github.com/apache/beam/pull/8018]. > h2. Iterating / Discussing: > * [#7971|https://github.com/apache/beam/pull/7971]: Flink portable metrics: > get ptransform from MonitoringInfo, not stage name > ** this is a simpler, Flink-specific PR that is basically duplicated inside > each of the following two, so may be worth trying to merge in first > * #[7915|https://github.com/apache/beam/pull/7915]: use MonitoringInfo data > model in Java SDK metrics > * [#7868|https://github.com/apache/beam/pull/7868]: MonitoringInfo URN tweaks > h2. Merged > * [#8018|https://github.com/apache/beam/pull/8018]: add job metrics RPC > protos > * [#7867|https://github.com/apache/beam/pull/7867]: key MetricResult by a > MetricKey > * [#7938|https://github.com/apache/beam/pull/7938]: move MonitoringInfo > protos to model/pipeline module > * [#7883|https://github.com/apache/beam/pull/7883]: Add > MetricQueryResults.allMetrics() helper > * [#7866|https://github.com/apache/beam/pull/7866]: move function helpers > from fn-harness to sdks/java/core > * [#7890|https://github.com/apache/beam/pull/7890]: consolidate MetricResult > implementations > h2. Closed > * [#7934|https://github.com/apache/beam/pull/7934]: job metrics RPC + SDK > support > * [#7876|https://github.com/apache/beam/pull/7876]: Clean up metric protos; > support integer distributions,
[jira] [Work logged] (BEAM-8335) Add streaming support to Interactive Beam
[ https://issues.apache.org/jira/browse/BEAM-8335?focusedWorklogId=332879=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332879 ] ASF GitHub Bot logged work on BEAM-8335: Author: ASF GitHub Bot Created on: 23/Oct/19 20:52 Start Date: 23/Oct/19 20:52 Worklog Time Spent: 10m Work Description: rohdesamuel commented on issue #9720: [BEAM-8335] Add initial modules for interactive streaming support URL: https://github.com/apache/beam/pull/9720#issuecomment-545631086 I discussed with Robert the necessity of the GRPC service. In it, we agreed that the easiest way to support an interactive streaming job is through an HTTP service. Unfortunately, using HTTP transcoding with GRPC requires having a copy certain protos in the repo to be used during proto compilation. Thus, I will make a separate HTTP server without using the GRPC service definition and remove all InteractiveService rpcs _except_ for the Events request. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 332879) Time Spent: 11h (was: 10h 50m) > Add streaming support to Interactive Beam > - > > Key: BEAM-8335 > URL: https://issues.apache.org/jira/browse/BEAM-8335 > Project: Beam > Issue Type: Improvement > Components: runner-py-interactive >Reporter: Sam Rohde >Assignee: Sam Rohde >Priority: Major > Time Spent: 11h > Remaining Estimate: 0h > > This issue tracks the work items to introduce streaming support to the > Interactive Beam experience. This will allow users to: > * Write and run a streaming job in IPython > * Automatically cache records from unbounded sources > * Add a replay experience that replays all cached records to simulate the > original pipeline execution > * Add controls to play/pause/stop/step individual elements from the cached > records > * Add ability to inspect/visualize unbounded PCollections -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8468) Add predicate/filter push-down capability to IO APIs
[ https://issues.apache.org/jira/browse/BEAM-8468?focusedWorklogId=332869=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332869 ] ASF GitHub Bot logged work on BEAM-8468: Author: ASF GitHub Bot Created on: 23/Oct/19 20:46 Start Date: 23/Oct/19 20:46 Worklog Time Spent: 10m Work Description: 11moon11 commented on pull request #9863: [BEAM-8468] Predicate push down for in memory table URL: https://github.com/apache/beam/pull/9863 - Create a class `TestTableFilter implements BeamSqlTableFilter` - Update `buildIOReader(PBegin begin, BeamSqlTableFilter filters, List fieldNames)` method - Add `BeamSqlTableFilter constructFilter(List filter)` method - Update push-down rule Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily: - [ ] [**Choose reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and mention them in a comment (`R: @username`). - [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). See the [Contributor Guide](https://beam.apache.org/contribute) for more tips on [how to make review process smoother](https://beam.apache.org/contribute/#make-reviewers-job-easier). Post-Commit Tests Status (on master branch) Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark --- | --- | --- | --- | --- | --- | --- | --- Go | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/) Java | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/) Python | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/)[![Build
[jira] [Work logged] (BEAM-8468) Add predicate/filter push-down capability to IO APIs
[ https://issues.apache.org/jira/browse/BEAM-8468?focusedWorklogId=332870=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332870 ] ASF GitHub Bot logged work on BEAM-8468: Author: ASF GitHub Bot Created on: 23/Oct/19 20:46 Start Date: 23/Oct/19 20:46 Worklog Time Spent: 10m Work Description: 11moon11 commented on issue #9863: [BEAM-8468] Predicate push down for in memory table URL: https://github.com/apache/beam/pull/9863#issuecomment-545628821 R: @apilloud cc: @amaliujia This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 332870) Time Spent: 20m (was: 10m) > Add predicate/filter push-down capability to IO APIs > > > Key: BEAM-8468 > URL: https://issues.apache.org/jira/browse/BEAM-8468 > Project: Beam > Issue Type: New Feature > Components: dsl-sql >Reporter: Kirill Kozlov >Assignee: Kirill Kozlov >Priority: Major > Time Spent: 20m > Remaining Estimate: 0h > > * InMemoryTable should implement a following methods: > {code:java} > public PCollection buildIOReader( > PBegin begin, BeamSqlTableFilter filters, List fieldNames); > public BeamSqlTableFilter constructFilter(List filter); > {code} > * Update a push-down rule to support predicate/filter push-down. > * Create a class > {code:java} > class TestTableFilter implements BeamSqlTableFilter{code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-8468) Add predicate/filter push-down capability to IO APIs
[ https://issues.apache.org/jira/browse/BEAM-8468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kirill Kozlov updated BEAM-8468: Description: * InMemoryTable should implement a following methods: {code:java} public PCollection buildIOReader( PBegin begin, BeamSqlTableFilter filters, List fieldNames); public BeamSqlTableFilter constructFilter(List filter); {code} * Update a push-down rule to support predicate/filter push-down. * Create a class {code:java} class TestTableFilter implements BeamSqlTableFilter{code} was: * InMemoryTable should implement a following methods: {code:java} public PCollection buildIOReader( PBegin begin, BeamSqlTableFilter filters, List fieldNames); public BeamSqlTableFilter constructFilter(List filter); {code} * Update a push-down rule to support predicate/filter push-down. * Create a class {code:java} class TestTableFilter implements BeamSqlTableFilter{code} > Add predicate/filter push-down capability to IO APIs > > > Key: BEAM-8468 > URL: https://issues.apache.org/jira/browse/BEAM-8468 > Project: Beam > Issue Type: New Feature > Components: dsl-sql >Reporter: Kirill Kozlov >Assignee: Kirill Kozlov >Priority: Major > > * InMemoryTable should implement a following methods: > {code:java} > public PCollection buildIOReader( > PBegin begin, BeamSqlTableFilter filters, List fieldNames); > public BeamSqlTableFilter constructFilter(List filter); > {code} > * Update a push-down rule to support predicate/filter push-down. > * Create a class > {code:java} > class TestTableFilter implements BeamSqlTableFilter{code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (BEAM-8468) Add predicate/filter push-down capability to IO APIs
Kirill Kozlov created BEAM-8468: --- Summary: Add predicate/filter push-down capability to IO APIs Key: BEAM-8468 URL: https://issues.apache.org/jira/browse/BEAM-8468 Project: Beam Issue Type: New Feature Components: dsl-sql Reporter: Kirill Kozlov Assignee: Kirill Kozlov * InMemoryTable should implement a following methods: {code:java} public PCollection buildIOReader( PBegin begin, BeamSqlTableFilter filters, List fieldNames); public BeamSqlTableFilter constructFilter(List filter); {code} * Update a push-down rule to support predicate/filter push-down. * Create a class {code:java} class TestTableFilter implements BeamSqlTableFilter{code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8457) Instrument Dataflow jobs that are launched from Notebooks
[ https://issues.apache.org/jira/browse/BEAM-8457?focusedWorklogId=332850=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332850 ] ASF GitHub Bot logged work on BEAM-8457: Author: ASF GitHub Bot Created on: 23/Oct/19 20:25 Start Date: 23/Oct/19 20:25 Worklog Time Spent: 10m Work Description: pabloem commented on pull request #9854: [BEAM-8457] Label Dataflow jobs from Notebook URL: https://github.com/apache/beam/pull/9854 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 332850) Time Spent: 1.5h (was: 1h 20m) > Instrument Dataflow jobs that are launched from Notebooks > - > > Key: BEAM-8457 > URL: https://issues.apache.org/jira/browse/BEAM-8457 > Project: Beam > Issue Type: Improvement > Components: runner-py-interactive >Reporter: Ning Kang >Assignee: Ning Kang >Priority: Major > Fix For: 2.17.0 > > Time Spent: 1.5h > Remaining Estimate: 0h > > Dataflow needs the capability to tell how many Dataflow jobs are launched > from the Notebook environment, i.e., the Interactive Runner. > # Change the pipeline.run() API to allow supply a runner and an option > parameter so that a pipeline initially bundled w/ an interactive runner can > be directly run by other runners from notebook. > # Implicitly add the necessary source information through user labels when > the user does p.run(runner=DataflowRunner()). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8457) Instrument Dataflow jobs that are launched from Notebooks
[ https://issues.apache.org/jira/browse/BEAM-8457?focusedWorklogId=332851=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332851 ] ASF GitHub Bot logged work on BEAM-8457: Author: ASF GitHub Bot Created on: 23/Oct/19 20:25 Start Date: 23/Oct/19 20:25 Worklog Time Spent: 10m Work Description: pabloem commented on issue #9854: [BEAM-8457] Label Dataflow jobs from Notebook URL: https://github.com/apache/beam/pull/9854#issuecomment-545620626 Thanks Ning This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 332851) Time Spent: 1h 40m (was: 1.5h) > Instrument Dataflow jobs that are launched from Notebooks > - > > Key: BEAM-8457 > URL: https://issues.apache.org/jira/browse/BEAM-8457 > Project: Beam > Issue Type: Improvement > Components: runner-py-interactive >Reporter: Ning Kang >Assignee: Ning Kang >Priority: Major > Fix For: 2.17.0 > > Time Spent: 1h 40m > Remaining Estimate: 0h > > Dataflow needs the capability to tell how many Dataflow jobs are launched > from the Notebook environment, i.e., the Interactive Runner. > # Change the pipeline.run() API to allow supply a runner and an option > parameter so that a pipeline initially bundled w/ an interactive runner can > be directly run by other runners from notebook. > # Implicitly add the necessary source information through user labels when > the user does p.run(runner=DataflowRunner()). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-8397) DataflowRunnerTest.test_remote_runner_display_data fails due to infinite recursion during pickling.
[ https://issues.apache.org/jira/browse/BEAM-8397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958213#comment-16958213 ] Valentyn Tymofieiev commented on BEAM-8397: --- Looking closer at the test_remote_runner_display_data test, I have a feeling that the expected values were chosen by whatever values made the test pass. For example, the test expects [1]entries "a_class", "a_time", but does not expect "asubcomponent" [1]https://github.com/apache/beam/blob/3330069291d8168c56c77acfef84c2566af05ec6/sdks/python/apache_beam/runners/dataflow/dataflow_runner_test.py#L273 > DataflowRunnerTest.test_remote_runner_display_data fails due to infinite > recursion during pickling. > --- > > Key: BEAM-8397 > URL: https://issues.apache.org/jira/browse/BEAM-8397 > Project: Beam > Issue Type: Sub-task > Components: sdk-py-core >Reporter: Valentyn Tymofieiev >Assignee: Valentyn Tymofieiev >Priority: Major > > `python ./setup.py test -s > apache_beam.runners.dataflow.dataflow_runner_test.DataflowRunnerTest.test_remote_runner_display_data` > passes. > `tox -e py37-gcp` passes if Beam depends on dill==0.3.0, but fails if Beam > depends on dill==0.3.1.1.`python ./setup.py nosetests --tests > 'apache_beam/runners/dataflow/dataflow_runner_test.py:DataflowRunnerTest.test_remote_runner_display_data` > fails currently if run on master. > The failure indicates infinite recursion during pickling: > {noformat} > test_remote_runner_display_data > (apache_beam.runners.dataflow.dataflow_runner_test.DataflowRunnerTest) ... > Fatal Python error: Cannot recover from stack overflow. > Current thread 0x7f9d700ed740 (most recent call first): > File "/usr/lib/python3.7/pickle.py", line 479 in get > File "/usr/lib/python3.7/pickle.py", line 497 in save > File "/usr/lib/python3.7/pickle.py", line 786 in save_tuple > File "/usr/lib/python3.7/pickle.py", line 504 in save > File "/usr/lib/python3.7/pickle.py", line 638 in save_reduce > File > "/usr/local/google/home/valentyn/tmp/py37env/lib/python3.7/site-packages/dill/_dill.py", > line 1394 in save_function > File "/usr/lib/python3.7/pickle.py", line 504 in save > File "/usr/lib/python3.7/pickle.py", line 882 in _batch_setitems > File "/usr/lib/python3.7/pickle.py", line 856 in save_dict > File > "/usr/local/google/home/valentyn/tmp/py37env/lib/python3.7/site-packages/dill/_dill.py", > line 910 in save_module_dict > File > "/usr/local/google/home/valentyn/projects/beam/clean/beam/sdks/python/apache_beam/internal/pickler.py", > line 198 in new_save_module_dict > File "/usr/lib/python3.7/pickle.py", line 504 in save > File "/usr/lib/python3.7/pickle.py", line 786 in save_tuple > File "/usr/lib/python3.7/pickle.py", line 504 in save > File "/usr/lib/python3.7/pickle.py", line 638 in save_reduce > File > "/usr/local/google/home/valentyn/projects/beam/clean/beam/sdks/python/apache_beam/internal/pickler.py", > line 114 in wrapper > File "/usr/lib/python3.7/pickle.py", line 504 in save > File "/usr/lib/python3.7/pickle.py", line 771 in save_tuple > File "/usr/lib/python3.7/pickle.py", line 504 in save > File "/usr/lib/python3.7/pickle.py", line 638 in save_reduce > File > "/usr/local/google/home/valentyn/tmp/py37env/lib/python3.7/site-packages/dill/_dill.py", > line 1137 in save_cell > File "/usr/lib/python3.7/pickle.py", line 504 in save > File "/usr/lib/python3.7/pickle.py", line 771 in save_tuple > File "/usr/lib/python3.7/pickle.py", line 504 in save > File "/usr/lib/python3.7/pickle.py", line 786 in save_tuple > File "/usr/lib/python3.7/pickle.py", line 504 in save > File "/usr/lib/python3.7/pickle.py", line 638 in save_reduce > File > "/usr/local/google/home/valentyn/tmp/py37env/lib/python3.7/site-packages/dill/_dill.py", > line 1394 in save_function > File "/usr/lib/python3.7/pickle.py", line 504 in save > File "/usr/lib/python3.7/pickle.py", line 882 in _batch_setitems > File "/usr/lib/python3.7/pickle.py", line 856 in save_dict > File > "/usr/local/google/home/valentyn/tmp/py37env/lib/python3.7/site-packages/dill/_dill.py", > line 910 in save_module_dict > File > "/usr/local/google/home/valentyn/projects/beam/clean/beam/sdks/python/apache_beam/internal/pickler.py", > line 198 in new_save_module_dict > ... > {noformat} > cc: [~yoshiki.obata] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (BEAM-8397) DataflowRunnerTest.test_remote_runner_display_data fails due to infinite recursion during pickling.
[ https://issues.apache.org/jira/browse/BEAM-8397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958213#comment-16958213 ] Valentyn Tymofieiev edited comment on BEAM-8397 at 10/23/19 8:12 PM: - Looking closer at the test_remote_runner_display_data test, I have a feeling that the expected values were chosen by whatever values made the test pass. For example, the test expects[1] entries "a_class", "a_time", but does not expect "asubcomponent" [1] https://github.com/apache/beam/blob/3330069291d8168c56c77acfef84c2566af05ec6/sdks/python/apache_beam/runners/dataflow/dataflow_runner_test.py#L273 was (Author: tvalentyn): Looking closer at the test_remote_runner_display_data test, I have a feeling that the expected values were chosen by whatever values made the test pass. For example, the test expects [1]entries "a_class", "a_time", but does not expect "asubcomponent" [1]https://github.com/apache/beam/blob/3330069291d8168c56c77acfef84c2566af05ec6/sdks/python/apache_beam/runners/dataflow/dataflow_runner_test.py#L273 > DataflowRunnerTest.test_remote_runner_display_data fails due to infinite > recursion during pickling. > --- > > Key: BEAM-8397 > URL: https://issues.apache.org/jira/browse/BEAM-8397 > Project: Beam > Issue Type: Sub-task > Components: sdk-py-core >Reporter: Valentyn Tymofieiev >Assignee: Valentyn Tymofieiev >Priority: Major > > `python ./setup.py test -s > apache_beam.runners.dataflow.dataflow_runner_test.DataflowRunnerTest.test_remote_runner_display_data` > passes. > `tox -e py37-gcp` passes if Beam depends on dill==0.3.0, but fails if Beam > depends on dill==0.3.1.1.`python ./setup.py nosetests --tests > 'apache_beam/runners/dataflow/dataflow_runner_test.py:DataflowRunnerTest.test_remote_runner_display_data` > fails currently if run on master. > The failure indicates infinite recursion during pickling: > {noformat} > test_remote_runner_display_data > (apache_beam.runners.dataflow.dataflow_runner_test.DataflowRunnerTest) ... > Fatal Python error: Cannot recover from stack overflow. > Current thread 0x7f9d700ed740 (most recent call first): > File "/usr/lib/python3.7/pickle.py", line 479 in get > File "/usr/lib/python3.7/pickle.py", line 497 in save > File "/usr/lib/python3.7/pickle.py", line 786 in save_tuple > File "/usr/lib/python3.7/pickle.py", line 504 in save > File "/usr/lib/python3.7/pickle.py", line 638 in save_reduce > File > "/usr/local/google/home/valentyn/tmp/py37env/lib/python3.7/site-packages/dill/_dill.py", > line 1394 in save_function > File "/usr/lib/python3.7/pickle.py", line 504 in save > File "/usr/lib/python3.7/pickle.py", line 882 in _batch_setitems > File "/usr/lib/python3.7/pickle.py", line 856 in save_dict > File > "/usr/local/google/home/valentyn/tmp/py37env/lib/python3.7/site-packages/dill/_dill.py", > line 910 in save_module_dict > File > "/usr/local/google/home/valentyn/projects/beam/clean/beam/sdks/python/apache_beam/internal/pickler.py", > line 198 in new_save_module_dict > File "/usr/lib/python3.7/pickle.py", line 504 in save > File "/usr/lib/python3.7/pickle.py", line 786 in save_tuple > File "/usr/lib/python3.7/pickle.py", line 504 in save > File "/usr/lib/python3.7/pickle.py", line 638 in save_reduce > File > "/usr/local/google/home/valentyn/projects/beam/clean/beam/sdks/python/apache_beam/internal/pickler.py", > line 114 in wrapper > File "/usr/lib/python3.7/pickle.py", line 504 in save > File "/usr/lib/python3.7/pickle.py", line 771 in save_tuple > File "/usr/lib/python3.7/pickle.py", line 504 in save > File "/usr/lib/python3.7/pickle.py", line 638 in save_reduce > File > "/usr/local/google/home/valentyn/tmp/py37env/lib/python3.7/site-packages/dill/_dill.py", > line 1137 in save_cell > File "/usr/lib/python3.7/pickle.py", line 504 in save > File "/usr/lib/python3.7/pickle.py", line 771 in save_tuple > File "/usr/lib/python3.7/pickle.py", line 504 in save > File "/usr/lib/python3.7/pickle.py", line 786 in save_tuple > File "/usr/lib/python3.7/pickle.py", line 504 in save > File "/usr/lib/python3.7/pickle.py", line 638 in save_reduce > File > "/usr/local/google/home/valentyn/tmp/py37env/lib/python3.7/site-packages/dill/_dill.py", > line 1394 in save_function > File "/usr/lib/python3.7/pickle.py", line 504 in save > File "/usr/lib/python3.7/pickle.py", line 882 in _batch_setitems > File "/usr/lib/python3.7/pickle.py", line 856 in save_dict > File > "/usr/local/google/home/valentyn/tmp/py37env/lib/python3.7/site-packages/dill/_dill.py", > line 910 in save_module_dict > File >
[jira] [Work logged] (BEAM-7926) Visualize PCollection with Interactive Beam
[ https://issues.apache.org/jira/browse/BEAM-7926?focusedWorklogId=332841=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332841 ] ASF GitHub Bot logged work on BEAM-7926: Author: ASF GitHub Bot Created on: 23/Oct/19 20:11 Start Date: 23/Oct/19 20:11 Worklog Time Spent: 10m Work Description: KevinGG commented on pull request #9741: [BEAM-7926] Visualize PCollection URL: https://github.com/apache/beam/pull/9741#discussion_r338254102 ## File path: sdks/python/apache_beam/runners/interactive/display/pcoll_visualization_test.py ## @@ -0,0 +1,152 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +"""Tests for apache_beam.runners.interactive.display.pcoll_visualization.""" +from __future__ import absolute_import + +import sys +import time +import unittest + +import apache_beam as beam # pylint: disable=ungrouped-imports +import timeloop +from apache_beam.runners import runner +from apache_beam.runners.interactive import interactive_environment as ie +from apache_beam.runners.interactive.display import pcoll_visualization as pv + +# Work around nose tests using Python2 without unittest.mock module. +try: + from unittest.mock import patch +except ImportError: + from mock import patch + + +class PCollVisualizationTest(unittest.TestCase): + + def setUp(self): +self._p = beam.Pipeline() +# pylint: disable=range-builtin-not-iterating +self._pcoll = self._p | 'Create' >> beam.Create(range(1000)) + + @unittest.skipIf(sys.version_info < (3, 5, 3), + 'PCollVisualization is not supported on Python 2.') Review comment: Changing the error message. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 332841) Time Spent: 13h 20m (was: 13h 10m) > Visualize PCollection with Interactive Beam > --- > > Key: BEAM-7926 > URL: https://issues.apache.org/jira/browse/BEAM-7926 > Project: Beam > Issue Type: New Feature > Components: runner-py-interactive >Reporter: Ning Kang >Assignee: Ning Kang >Priority: Major > Time Spent: 13h 20m > Remaining Estimate: 0h > > Support auto plotting / charting of materialized data of a given PCollection > with Interactive Beam. > Say an Interactive Beam pipeline defined as > p = create_pipeline() > pcoll = p | 'Transform' >> transform() > The use can call a single function and get auto-magical charting of the data > as materialized pcoll. > e.g., visualize(pcoll) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7926) Visualize PCollection with Interactive Beam
[ https://issues.apache.org/jira/browse/BEAM-7926?focusedWorklogId=332840=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332840 ] ASF GitHub Bot logged work on BEAM-7926: Author: ASF GitHub Bot Created on: 23/Oct/19 20:10 Start Date: 23/Oct/19 20:10 Worklog Time Spent: 10m Work Description: KevinGG commented on pull request #9741: [BEAM-7926] Visualize PCollection URL: https://github.com/apache/beam/pull/9741#discussion_r338253651 ## File path: sdks/python/apache_beam/runners/interactive/display/pcoll_visualization.py ## @@ -0,0 +1,279 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +"""Module visualizes PCollection data. + +For internal use only; no backwards-compatibility guarantees. +Only works with Python 3.5+. +""" +from __future__ import absolute_import + +import base64 +import logging +from datetime import timedelta + +from pandas.io.json import json_normalize + +from apache_beam import pvalue +from apache_beam.runners.interactive import interactive_environment as ie +from apache_beam.runners.interactive import pipeline_instrument as instr + +# jsons doesn't support < Python 3.5. Work around with json for legacy tests. +# TODO(BEAM-8288): clean up once Py2 is deprecated from Beam. +try: + import jsons + _pv_jsons_load = jsons.load + _pv_jsons_dump = jsons.dump +except ImportError: + import json + _pv_jsons_load = json.load + _pv_jsons_dump = json.dump + +try: + from facets_overview.generic_feature_statistics_generator import GenericFeatureStatisticsGenerator + _facets_gfsg_ready = True +except ImportError: + _facets_gfsg_ready = False + +try: + from IPython.core.display import HTML + from IPython.core.display import Javascript + from IPython.core.display import display + from IPython.core.display import display_javascript + from IPython.core.display import update_display + _ipython_ready = True +except ImportError: + _ipython_ready = False + +try: + from timeloop import Timeloop + _tl_ready = True +except ImportError: + _tl_ready = False + +# 1-d types that need additional normalization to be compatible with DataFrame. +_one_dimension_types = (int, float, str, bool, list, tuple) + +_DIVE_SCRIPT_TEMPLATE = """ +document.querySelector("#{display_id}").data = {jsonstr};""" +_DIVE_HTML_TEMPLATE = """ +https://cdnjs.cloudflare.com/ajax/libs/webcomponentsjs/1.3.3/webcomponents-lite.js"> +https://raw.githubusercontent.com/PAIR-code/facets/1.0.0/facets-dist/facets-jupyter.html;> + + + document.querySelector("#{display_id}").data = {jsonstr}; +""" +_OVERVIEW_SCRIPT_TEMPLATE = """ + document.querySelector("#{display_id}").protoInput = "{protostr}"; + """ +_OVERVIEW_HTML_TEMPLATE = """ +https://cdnjs.cloudflare.com/ajax/libs/webcomponentsjs/1.3.3/webcomponents-lite.js"> +https://raw.githubusercontent.com/PAIR-code/facets/1.0.0/facets-dist/facets-jupyter.html;> + + + document.querySelector("#{display_id}").protoInput = "{protostr}"; +""" +_DATAFRAME_PAGINATION_TEMPLATE = """ +https://ajax.googleapis.com/ajax/libs/jquery/2.2.2/jquery.min.js"> +https://cdn.datatables.net/1.10.16/js/jquery.dataTables.js"> +https://cdn.datatables.net/1.10.16/css/jquery.dataTables.css;> +{dataframe_html} + + $("#{table_id}").DataTable(); +""" + + +def visualize(pcoll, dynamic_plotting_interval=None): + """Visualizes the data of a given PCollection. Optionally enables dynamic + plotting with interval in seconds if the PCollection is being produced by a + running pipeline or the pipeline is streaming indefinitely. The function + always returns immediately and is asynchronous when dynamic plotting is on. + + If dynamic plotting enabled, the visualization is updated continuously until + the pipeline producing the PCollection is in an end state. The visualization + would be anchored to the notebook cell output
[jira] [Work logged] (BEAM-8432) Parametrize source & target compatibility for beam Java modules
[ https://issues.apache.org/jira/browse/BEAM-8432?focusedWorklogId=332838=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332838 ] ASF GitHub Bot logged work on BEAM-8432: Author: ASF GitHub Bot Created on: 23/Oct/19 20:09 Start Date: 23/Oct/19 20:09 Worklog Time Spent: 10m Work Description: lgajowy commented on issue #9830: [BEAM-8432] Move javaVersion to gradle.properties URL: https://github.com/apache/beam/pull/9830#issuecomment-545614858 Run Python PreCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 332838) Time Spent: 20m (was: 10m) > Parametrize source & target compatibility for beam Java modules > --- > > Key: BEAM-8432 > URL: https://issues.apache.org/jira/browse/BEAM-8432 > Project: Beam > Issue Type: Improvement > Components: build-system >Reporter: Lukasz Gajowy >Assignee: Lukasz Gajowy >Priority: Major > Fix For: Not applicable > > Time Spent: 20m > Remaining Estimate: 0h > > Currently, "javaVersion" property is hardcoded in BeamModulePlugin in > [JavaNatureConfiguration|https://github.com/apache/beam/blob/master/buildSrc/src/main/groovy/org/apache/beam/gradle/BeamModulePlugin.groovy#L82]. > For the sake of migrating the project to Java 11 we could use a mechanism > that will allow parametrizing the version from the command line, e.g: > {code:java} > // this could set source and target compatibility to 11: > ./gradlew clean build -PjavaVersion=11{code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8432) Parametrize source & target compatibility for beam Java modules
[ https://issues.apache.org/jira/browse/BEAM-8432?focusedWorklogId=332839=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332839 ] ASF GitHub Bot logged work on BEAM-8432: Author: ASF GitHub Bot Created on: 23/Oct/19 20:09 Start Date: 23/Oct/19 20:09 Worklog Time Spent: 10m Work Description: lgajowy commented on issue #9830: [BEAM-8432] Move javaVersion to gradle.properties URL: https://github.com/apache/beam/pull/9830#issuecomment-545614924 Run CommunityMetrics PreCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 332839) Time Spent: 0.5h (was: 20m) > Parametrize source & target compatibility for beam Java modules > --- > > Key: BEAM-8432 > URL: https://issues.apache.org/jira/browse/BEAM-8432 > Project: Beam > Issue Type: Improvement > Components: build-system >Reporter: Lukasz Gajowy >Assignee: Lukasz Gajowy >Priority: Major > Fix For: Not applicable > > Time Spent: 0.5h > Remaining Estimate: 0h > > Currently, "javaVersion" property is hardcoded in BeamModulePlugin in > [JavaNatureConfiguration|https://github.com/apache/beam/blob/master/buildSrc/src/main/groovy/org/apache/beam/gradle/BeamModulePlugin.groovy#L82]. > For the sake of migrating the project to Java 11 we could use a mechanism > that will allow parametrizing the version from the command line, e.g: > {code:java} > // this could set source and target compatibility to 11: > ./gradlew clean build -PjavaVersion=11{code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8446) apache_beam.io.gcp.bigquery_write_it_test.BigQueryWriteIntegrationTests.test_big_query_write_new_types is flaky
[ https://issues.apache.org/jira/browse/BEAM-8446?focusedWorklogId=332834=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332834 ] ASF GitHub Bot logged work on BEAM-8446: Author: ASF GitHub Bot Created on: 23/Oct/19 19:59 Start Date: 23/Oct/19 19:59 Worklog Time Spent: 10m Work Description: pabloem commented on issue #9855: [BEAM-8446] Retrying BQ query on timeouts URL: https://github.com/apache/beam/pull/9855#issuecomment-545611313 Run Python PreCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 332834) Time Spent: 2h 40m (was: 2.5h) > apache_beam.io.gcp.bigquery_write_it_test.BigQueryWriteIntegrationTests.test_big_query_write_new_types > is flaky > --- > > Key: BEAM-8446 > URL: https://issues.apache.org/jira/browse/BEAM-8446 > Project: Beam > Issue Type: New Feature > Components: test-failures >Reporter: Boyuan Zhang >Assignee: Pablo Estrada >Priority: Major > Time Spent: 2h 40m > Remaining Estimate: 0h > > test_big_query_write_new_types appears to be flaky in > beam_PostCommit_Python37 test suite. > https://builds.apache.org/job/beam_PostCommit_Python37/733/ > https://builds.apache.org/job/beam_PostCommit_Python37/739/ -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8446) apache_beam.io.gcp.bigquery_write_it_test.BigQueryWriteIntegrationTests.test_big_query_write_new_types is flaky
[ https://issues.apache.org/jira/browse/BEAM-8446?focusedWorklogId=332833=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332833 ] ASF GitHub Bot logged work on BEAM-8446: Author: ASF GitHub Bot Created on: 23/Oct/19 19:59 Start Date: 23/Oct/19 19:59 Worklog Time Spent: 10m Work Description: pabloem commented on issue #9855: [BEAM-8446] Retrying BQ query on timeouts URL: https://github.com/apache/beam/pull/9855#issuecomment-545170744 First instance of job running: https://builds.apache.org/job/beam_PostCommit_Python37_PR/42/ Next: https://builds.apache.org/job/beam_PostCommit_Python37_PR/43/ Next: https://builds.apache.org/job/beam_PostCommit_Python37_PR/44/ - timeout without errors Next: https://builds.apache.org/job/beam_PostCommit_Python37_PR/45/ - timeout without errors This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 332833) Time Spent: 2.5h (was: 2h 20m) > apache_beam.io.gcp.bigquery_write_it_test.BigQueryWriteIntegrationTests.test_big_query_write_new_types > is flaky > --- > > Key: BEAM-8446 > URL: https://issues.apache.org/jira/browse/BEAM-8446 > Project: Beam > Issue Type: New Feature > Components: test-failures >Reporter: Boyuan Zhang >Assignee: Pablo Estrada >Priority: Major > Time Spent: 2.5h > Remaining Estimate: 0h > > test_big_query_write_new_types appears to be flaky in > beam_PostCommit_Python37 test suite. > https://builds.apache.org/job/beam_PostCommit_Python37/733/ > https://builds.apache.org/job/beam_PostCommit_Python37/739/ -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8446) apache_beam.io.gcp.bigquery_write_it_test.BigQueryWriteIntegrationTests.test_big_query_write_new_types is flaky
[ https://issues.apache.org/jira/browse/BEAM-8446?focusedWorklogId=332832=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332832 ] ASF GitHub Bot logged work on BEAM-8446: Author: ASF GitHub Bot Created on: 23/Oct/19 19:58 Start Date: 23/Oct/19 19:58 Worklog Time Spent: 10m Work Description: pabloem commented on issue #9855: [BEAM-8446] Retrying BQ query on timeouts URL: https://github.com/apache/beam/pull/9855#issuecomment-545611045 Run Python 3.7 PostCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 332832) Time Spent: 2h 20m (was: 2h 10m) > apache_beam.io.gcp.bigquery_write_it_test.BigQueryWriteIntegrationTests.test_big_query_write_new_types > is flaky > --- > > Key: BEAM-8446 > URL: https://issues.apache.org/jira/browse/BEAM-8446 > Project: Beam > Issue Type: New Feature > Components: test-failures >Reporter: Boyuan Zhang >Assignee: Pablo Estrada >Priority: Major > Time Spent: 2h 20m > Remaining Estimate: 0h > > test_big_query_write_new_types appears to be flaky in > beam_PostCommit_Python37 test suite. > https://builds.apache.org/job/beam_PostCommit_Python37/733/ > https://builds.apache.org/job/beam_PostCommit_Python37/739/ -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-8446) apache_beam.io.gcp.bigquery_write_it_test.BigQueryWriteIntegrationTests.test_big_query_write_new_types is flaky
[ https://issues.apache.org/jira/browse/BEAM-8446?focusedWorklogId=332831=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332831 ] ASF GitHub Bot logged work on BEAM-8446: Author: ASF GitHub Bot Created on: 23/Oct/19 19:58 Start Date: 23/Oct/19 19:58 Worklog Time Spent: 10m Work Description: pabloem commented on issue #9855: [BEAM-8446] Retrying BQ query on timeouts URL: https://github.com/apache/beam/pull/9855#issuecomment-545611021 Again timeout without errors. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 332831) Time Spent: 2h 10m (was: 2h) > apache_beam.io.gcp.bigquery_write_it_test.BigQueryWriteIntegrationTests.test_big_query_write_new_types > is flaky > --- > > Key: BEAM-8446 > URL: https://issues.apache.org/jira/browse/BEAM-8446 > Project: Beam > Issue Type: New Feature > Components: test-failures >Reporter: Boyuan Zhang >Assignee: Pablo Estrada >Priority: Major > Time Spent: 2h 10m > Remaining Estimate: 0h > > test_big_query_write_new_types appears to be flaky in > beam_PostCommit_Python37 test suite. > https://builds.apache.org/job/beam_PostCommit_Python37/733/ > https://builds.apache.org/job/beam_PostCommit_Python37/739/ -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7926) Visualize PCollection with Interactive Beam
[ https://issues.apache.org/jira/browse/BEAM-7926?focusedWorklogId=332830=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332830 ] ASF GitHub Bot logged work on BEAM-7926: Author: ASF GitHub Bot Created on: 23/Oct/19 19:46 Start Date: 23/Oct/19 19:46 Worklog Time Spent: 10m Work Description: KevinGG commented on pull request #9741: [BEAM-7926] Visualize PCollection URL: https://github.com/apache/beam/pull/9741#discussion_r338243645 ## File path: sdks/python/setup.py ## @@ -166,6 +166,14 @@ def get_version(): 'google-cloud-bigtable>=0.31.1,<1.1.0', ] +INTERACTIVE_BEAM = [ +'facets-overview>=1.0.0,<2', Review comment: They are already sorted. Do you want me to also sort GCP_REQUIREMENTS? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 332830) Time Spent: 13h (was: 12h 50m) > Visualize PCollection with Interactive Beam > --- > > Key: BEAM-7926 > URL: https://issues.apache.org/jira/browse/BEAM-7926 > Project: Beam > Issue Type: New Feature > Components: runner-py-interactive >Reporter: Ning Kang >Assignee: Ning Kang >Priority: Major > Time Spent: 13h > Remaining Estimate: 0h > > Support auto plotting / charting of materialized data of a given PCollection > with Interactive Beam. > Say an Interactive Beam pipeline defined as > p = create_pipeline() > pcoll = p | 'Transform' >> transform() > The use can call a single function and get auto-magical charting of the data > as materialized pcoll. > e.g., visualize(pcoll) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-8397) DataflowRunnerTest.test_remote_runner_display_data fails due to infinite recursion during pickling.
[ https://issues.apache.org/jira/browse/BEAM-8397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958195#comment-16958195 ] Valentyn Tymofieiev commented on BEAM-8397: --- For some reason, after extracting inner subclasses SpecialDoFn, SpecialParDo out of test_remote_runner_display_data makes the test fail, because 1 out of 3 expected display_data entries does not appear: {noformat} Traceback (most recent call last): File "/usr/local/google/home/valentyn/projects/beam/beam/beam/sdks/python/apache_beam/runners/dataflow/dataflow_runner_test.py", line 290, in test_remote_runner_display_data self.assertEqual(len(disp_data), 3) AssertionError: 2 != 3 {noformat} Examination of output shows that TIMESTAMP display data entry is missing, while INT and STRING ones are present. Sounds like another issue... > DataflowRunnerTest.test_remote_runner_display_data fails due to infinite > recursion during pickling. > --- > > Key: BEAM-8397 > URL: https://issues.apache.org/jira/browse/BEAM-8397 > Project: Beam > Issue Type: Sub-task > Components: sdk-py-core >Reporter: Valentyn Tymofieiev >Assignee: Valentyn Tymofieiev >Priority: Major > > `python ./setup.py test -s > apache_beam.runners.dataflow.dataflow_runner_test.DataflowRunnerTest.test_remote_runner_display_data` > passes. > `tox -e py37-gcp` passes if Beam depends on dill==0.3.0, but fails if Beam > depends on dill==0.3.1.1.`python ./setup.py nosetests --tests > 'apache_beam/runners/dataflow/dataflow_runner_test.py:DataflowRunnerTest.test_remote_runner_display_data` > fails currently if run on master. > The failure indicates infinite recursion during pickling: > {noformat} > test_remote_runner_display_data > (apache_beam.runners.dataflow.dataflow_runner_test.DataflowRunnerTest) ... > Fatal Python error: Cannot recover from stack overflow. > Current thread 0x7f9d700ed740 (most recent call first): > File "/usr/lib/python3.7/pickle.py", line 479 in get > File "/usr/lib/python3.7/pickle.py", line 497 in save > File "/usr/lib/python3.7/pickle.py", line 786 in save_tuple > File "/usr/lib/python3.7/pickle.py", line 504 in save > File "/usr/lib/python3.7/pickle.py", line 638 in save_reduce > File > "/usr/local/google/home/valentyn/tmp/py37env/lib/python3.7/site-packages/dill/_dill.py", > line 1394 in save_function > File "/usr/lib/python3.7/pickle.py", line 504 in save > File "/usr/lib/python3.7/pickle.py", line 882 in _batch_setitems > File "/usr/lib/python3.7/pickle.py", line 856 in save_dict > File > "/usr/local/google/home/valentyn/tmp/py37env/lib/python3.7/site-packages/dill/_dill.py", > line 910 in save_module_dict > File > "/usr/local/google/home/valentyn/projects/beam/clean/beam/sdks/python/apache_beam/internal/pickler.py", > line 198 in new_save_module_dict > File "/usr/lib/python3.7/pickle.py", line 504 in save > File "/usr/lib/python3.7/pickle.py", line 786 in save_tuple > File "/usr/lib/python3.7/pickle.py", line 504 in save > File "/usr/lib/python3.7/pickle.py", line 638 in save_reduce > File > "/usr/local/google/home/valentyn/projects/beam/clean/beam/sdks/python/apache_beam/internal/pickler.py", > line 114 in wrapper > File "/usr/lib/python3.7/pickle.py", line 504 in save > File "/usr/lib/python3.7/pickle.py", line 771 in save_tuple > File "/usr/lib/python3.7/pickle.py", line 504 in save > File "/usr/lib/python3.7/pickle.py", line 638 in save_reduce > File > "/usr/local/google/home/valentyn/tmp/py37env/lib/python3.7/site-packages/dill/_dill.py", > line 1137 in save_cell > File "/usr/lib/python3.7/pickle.py", line 504 in save > File "/usr/lib/python3.7/pickle.py", line 771 in save_tuple > File "/usr/lib/python3.7/pickle.py", line 504 in save > File "/usr/lib/python3.7/pickle.py", line 786 in save_tuple > File "/usr/lib/python3.7/pickle.py", line 504 in save > File "/usr/lib/python3.7/pickle.py", line 638 in save_reduce > File > "/usr/local/google/home/valentyn/tmp/py37env/lib/python3.7/site-packages/dill/_dill.py", > line 1394 in save_function > File "/usr/lib/python3.7/pickle.py", line 504 in save > File "/usr/lib/python3.7/pickle.py", line 882 in _batch_setitems > File "/usr/lib/python3.7/pickle.py", line 856 in save_dict > File > "/usr/local/google/home/valentyn/tmp/py37env/lib/python3.7/site-packages/dill/_dill.py", > line 910 in save_module_dict > File > "/usr/local/google/home/valentyn/projects/beam/clean/beam/sdks/python/apache_beam/internal/pickler.py", > line 198 in new_save_module_dict > ... > {noformat} > cc: [~yoshiki.obata] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (BEAM-8397) DataflowRunnerTest.test_remote_runner_display_data fails due to infinite recursion during pickling.
[ https://issues.apache.org/jira/browse/BEAM-8397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958195#comment-16958195 ] Valentyn Tymofieiev edited comment on BEAM-8397 at 10/23/19 7:44 PM: - For some reason, after extracting inner subclasses SpecialDoFn, SpecialParDo out of test_remote_runner_display_data makes the test fail, because 1 out of 3 expected display_data entries does not appear: {noformat} Traceback (most recent call last): File "/usr/local/google/home/valentyn/projects/beam/beam/beam/sdks/python/apache_beam/runners/dataflow/dataflow_runner_test.py", line 290, in test_remote_runner_display_data self.assertEqual(len(disp_data), 3) AssertionError: 2 != 3 {noformat} Examination of output shows that TIMESTAMP display data entry is missing, while INT and STRING ones are present. Sounds like another issue... was (Author: tvalentyn): For some reason, after extracting inner subclasses SpecialDoFn, SpecialParDo out of test_remote_runner_display_data makes the test fail, because 1 out of 3 expected display_data entries does not appear: {noformat} Traceback (most recent call last): File "/usr/local/google/home/valentyn/projects/beam/beam/beam/sdks/python/apache_beam/runners/dataflow/dataflow_runner_test.py", line 290, in test_remote_runner_display_data self.assertEqual(len(disp_data), 3) AssertionError: 2 != 3 {noformat} Examination of output shows that TIMESTAMP display data entry is missing, while INT and STRING ones are present. Sounds like another issue... > DataflowRunnerTest.test_remote_runner_display_data fails due to infinite > recursion during pickling. > --- > > Key: BEAM-8397 > URL: https://issues.apache.org/jira/browse/BEAM-8397 > Project: Beam > Issue Type: Sub-task > Components: sdk-py-core >Reporter: Valentyn Tymofieiev >Assignee: Valentyn Tymofieiev >Priority: Major > > `python ./setup.py test -s > apache_beam.runners.dataflow.dataflow_runner_test.DataflowRunnerTest.test_remote_runner_display_data` > passes. > `tox -e py37-gcp` passes if Beam depends on dill==0.3.0, but fails if Beam > depends on dill==0.3.1.1.`python ./setup.py nosetests --tests > 'apache_beam/runners/dataflow/dataflow_runner_test.py:DataflowRunnerTest.test_remote_runner_display_data` > fails currently if run on master. > The failure indicates infinite recursion during pickling: > {noformat} > test_remote_runner_display_data > (apache_beam.runners.dataflow.dataflow_runner_test.DataflowRunnerTest) ... > Fatal Python error: Cannot recover from stack overflow. > Current thread 0x7f9d700ed740 (most recent call first): > File "/usr/lib/python3.7/pickle.py", line 479 in get > File "/usr/lib/python3.7/pickle.py", line 497 in save > File "/usr/lib/python3.7/pickle.py", line 786 in save_tuple > File "/usr/lib/python3.7/pickle.py", line 504 in save > File "/usr/lib/python3.7/pickle.py", line 638 in save_reduce > File > "/usr/local/google/home/valentyn/tmp/py37env/lib/python3.7/site-packages/dill/_dill.py", > line 1394 in save_function > File "/usr/lib/python3.7/pickle.py", line 504 in save > File "/usr/lib/python3.7/pickle.py", line 882 in _batch_setitems > File "/usr/lib/python3.7/pickle.py", line 856 in save_dict > File > "/usr/local/google/home/valentyn/tmp/py37env/lib/python3.7/site-packages/dill/_dill.py", > line 910 in save_module_dict > File > "/usr/local/google/home/valentyn/projects/beam/clean/beam/sdks/python/apache_beam/internal/pickler.py", > line 198 in new_save_module_dict > File "/usr/lib/python3.7/pickle.py", line 504 in save > File "/usr/lib/python3.7/pickle.py", line 786 in save_tuple > File "/usr/lib/python3.7/pickle.py", line 504 in save > File "/usr/lib/python3.7/pickle.py", line 638 in save_reduce > File > "/usr/local/google/home/valentyn/projects/beam/clean/beam/sdks/python/apache_beam/internal/pickler.py", > line 114 in wrapper > File "/usr/lib/python3.7/pickle.py", line 504 in save > File "/usr/lib/python3.7/pickle.py", line 771 in save_tuple > File "/usr/lib/python3.7/pickle.py", line 504 in save > File "/usr/lib/python3.7/pickle.py", line 638 in save_reduce > File > "/usr/local/google/home/valentyn/tmp/py37env/lib/python3.7/site-packages/dill/_dill.py", > line 1137 in save_cell > File "/usr/lib/python3.7/pickle.py", line 504 in save > File "/usr/lib/python3.7/pickle.py", line 771 in save_tuple > File "/usr/lib/python3.7/pickle.py", line 504 in save > File "/usr/lib/python3.7/pickle.py", line 786 in save_tuple > File "/usr/lib/python3.7/pickle.py", line 504 in save > File "/usr/lib/python3.7/pickle.py", line 638 in save_reduce > File >
[jira] [Work logged] (BEAM-7926) Visualize PCollection with Interactive Beam
[ https://issues.apache.org/jira/browse/BEAM-7926?focusedWorklogId=332826=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332826 ] ASF GitHub Bot logged work on BEAM-7926: Author: ASF GitHub Bot Created on: 23/Oct/19 19:37 Start Date: 23/Oct/19 19:37 Worklog Time Spent: 10m Work Description: KevinGG commented on pull request #9741: [BEAM-7926] Visualize PCollection URL: https://github.com/apache/beam/pull/9741#discussion_r338240149 ## File path: sdks/python/apache_beam/runners/interactive/display/pcoll_visualization_test.py ## @@ -0,0 +1,152 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +"""Tests for apache_beam.runners.interactive.display.pcoll_visualization.""" +from __future__ import absolute_import + +import sys +import time +import unittest + +import apache_beam as beam # pylint: disable=ungrouped-imports +import timeloop +from apache_beam.runners import runner +from apache_beam.runners.interactive import interactive_environment as ie +from apache_beam.runners.interactive.display import pcoll_visualization as pv + +# Work around nose tests using Python2 without unittest.mock module. +try: + from unittest.mock import patch +except ImportError: + from mock import patch + + +class PCollVisualizationTest(unittest.TestCase): + + def setUp(self): +self._p = beam.Pipeline() +# pylint: disable=range-builtin-not-iterating +self._pcoll = self._p | 'Create' >> beam.Create(range(1000)) + + @unittest.skipIf(sys.version_info < (3, 5, 3), + 'PCollVisualization is not supported on Python 2.') + def test_raise_error_for_non_pcoll_input(self): +class Foo(object): + pass + +with self.assertRaises(AssertionError) as ctx: + pv.PCollVisualization(Foo()) + self.assertTrue('pcoll should be apache_beam.pvalue.PCollection' in + ctx.exception) + + @unittest.skipIf(sys.version_info < (3, 5, 3), + 'PCollVisualization is not supported on Python 2.') + def test_pcoll_visualization_generate_unique_display_id(self): +pv_1 = pv.PCollVisualization(self._pcoll) +pv_2 = pv.PCollVisualization(self._pcoll) +self.assertNotEqual(pv_1._dive_display_id, pv_2._dive_display_id) +self.assertNotEqual(pv_1._overview_display_id, pv_2._overview_display_id) +self.assertNotEqual(pv_1._df_display_id, pv_2._df_display_id) + + @unittest.skipIf(sys.version_info < (3, 5, 3), + 'PCollVisualization is not supported on Python 2.') + @patch('apache_beam.runners.interactive.display.pcoll_visualization' + '.PCollVisualization._to_element_list', lambda x: [1, 2, 3]) + def test_one_shot_visualization_not_return_handle(self): +self.assertIsNone(pv.visualize(self._pcoll)) + + def _mock_to_element_list(self): +yield [1, 2, 3] +yield [1, 2, 3, 4] +yield [1, 2, 3, 4, 5] +yield [1, 2, 3, 4, 5, 6] +yield [1, 2, 3, 4, 5, 6, 7] +yield [1, 2, 3, 4, 5, 6, 7, 8] + + @unittest.skipIf(sys.version_info < (3, 5, 3), + 'PCollVisualization is not supported on Python 2.') + @patch('apache_beam.runners.interactive.display.pcoll_visualization' + '.PCollVisualization._to_element_list', _mock_to_element_list) + def test_dynamic_plotting_return_handle(self): +h = pv.visualize(self._pcoll, dynamic_plotting_interval=1) +self.assertIsInstance(h, timeloop.Timeloop) +h.stop() + + @unittest.skipIf(sys.version_info < (3, 5, 3), + 'PCollVisualization is not supported on Python 2.') + @patch('apache_beam.runners.interactive.display.pcoll_visualization' + '.PCollVisualization._to_element_list', _mock_to_element_list) + @patch('apache_beam.runners.interactive.display.pcoll_visualization' + '.PCollVisualization.display_facets') + def test_dynamic_plotting_update_same_display(self, +mocked_display_facets): +fake_pipeline_result = runner.PipelineResult(runner.PipelineState.RUNNING) +ie.current_env().set_pipeline_result(self._p, fake_pipeline_result) +# Starts async dynamic plotting that never
[jira] [Work logged] (BEAM-7926) Visualize PCollection with Interactive Beam
[ https://issues.apache.org/jira/browse/BEAM-7926?focusedWorklogId=332802=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332802 ] ASF GitHub Bot logged work on BEAM-7926: Author: ASF GitHub Bot Created on: 23/Oct/19 18:57 Start Date: 23/Oct/19 18:57 Worklog Time Spent: 10m Work Description: KevinGG commented on pull request #9741: [BEAM-7926] Visualize PCollection URL: https://github.com/apache/beam/pull/9741#discussion_r338223025 ## File path: sdks/python/apache_beam/runners/interactive/display/pcoll_visualization.py ## @@ -0,0 +1,279 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +"""Module visualizes PCollection data. + +For internal use only; no backwards-compatibility guarantees. +Only works with Python 3.5+. +""" +from __future__ import absolute_import + +import base64 +import logging +from datetime import timedelta + +from pandas.io.json import json_normalize + +from apache_beam import pvalue +from apache_beam.runners.interactive import interactive_environment as ie +from apache_beam.runners.interactive import pipeline_instrument as instr + +# jsons doesn't support < Python 3.5. Work around with json for legacy tests. +# TODO(BEAM-8288): clean up once Py2 is deprecated from Beam. +try: + import jsons + _pv_jsons_load = jsons.load + _pv_jsons_dump = jsons.dump +except ImportError: + import json + _pv_jsons_load = json.load + _pv_jsons_dump = json.dump + +try: + from facets_overview.generic_feature_statistics_generator import GenericFeatureStatisticsGenerator + _facets_gfsg_ready = True +except ImportError: + _facets_gfsg_ready = False + +try: + from IPython.core.display import HTML + from IPython.core.display import Javascript + from IPython.core.display import display + from IPython.core.display import display_javascript + from IPython.core.display import update_display + _ipython_ready = True +except ImportError: + _ipython_ready = False + +try: + from timeloop import Timeloop + _tl_ready = True +except ImportError: + _tl_ready = False + +# 1-d types that need additional normalization to be compatible with DataFrame. +_one_dimension_types = (int, float, str, bool, list, tuple) + +_DIVE_SCRIPT_TEMPLATE = """ +document.querySelector("#{display_id}").data = {jsonstr};""" +_DIVE_HTML_TEMPLATE = """ +https://cdnjs.cloudflare.com/ajax/libs/webcomponentsjs/1.3.3/webcomponents-lite.js"> +https://raw.githubusercontent.com/PAIR-code/facets/1.0.0/facets-dist/facets-jupyter.html;> + + + document.querySelector("#{display_id}").data = {jsonstr}; +""" +_OVERVIEW_SCRIPT_TEMPLATE = """ + document.querySelector("#{display_id}").protoInput = "{protostr}"; + """ +_OVERVIEW_HTML_TEMPLATE = """ +https://cdnjs.cloudflare.com/ajax/libs/webcomponentsjs/1.3.3/webcomponents-lite.js"> +https://raw.githubusercontent.com/PAIR-code/facets/1.0.0/facets-dist/facets-jupyter.html;> + + + document.querySelector("#{display_id}").protoInput = "{protostr}"; +""" +_DATAFRAME_PAGINATION_TEMPLATE = """ +https://ajax.googleapis.com/ajax/libs/jquery/2.2.2/jquery.min.js"> +https://cdn.datatables.net/1.10.16/js/jquery.dataTables.js"> +https://cdn.datatables.net/1.10.16/css/jquery.dataTables.css;> +{dataframe_html} + + $("#{table_id}").DataTable(); +""" + + +def visualize(pcoll, dynamic_plotting_interval=None): + """Visualizes the data of a given PCollection. Optionally enables dynamic + plotting with interval in seconds if the PCollection is being produced by a + running pipeline or the pipeline is streaming indefinitely. The function + always returns immediately and is asynchronous when dynamic plotting is on. + + If dynamic plotting enabled, the visualization is updated continuously until + the pipeline producing the PCollection is in an end state. The visualization + would be anchored to the notebook cell output
[jira] [Work logged] (BEAM-7926) Visualize PCollection with Interactive Beam
[ https://issues.apache.org/jira/browse/BEAM-7926?focusedWorklogId=332801=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332801 ] ASF GitHub Bot logged work on BEAM-7926: Author: ASF GitHub Bot Created on: 23/Oct/19 18:56 Start Date: 23/Oct/19 18:56 Worklog Time Spent: 10m Work Description: KevinGG commented on pull request #9741: [BEAM-7926] Visualize PCollection URL: https://github.com/apache/beam/pull/9741#discussion_r338222750 ## File path: sdks/python/setup.py ## @@ -219,11 +227,15 @@ def run(self): install_requires=REQUIRED_PACKAGES, python_requires=python_requires, test_suite='nose.collector', -tests_require=REQUIRED_TEST_PACKAGES, +tests_require=[ +REQUIRED_TEST_PACKAGES, +INTERACTIVE_BEAM, Review comment: Yes, I agree! This could avoid running those tests when python version is under 3.5.3. I guess then the only task would require [interactive] for older versioned python will be "docs" where it scans the source code. I'll just add the try-import statements everywhere to work around it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 332801) Time Spent: 12.5h (was: 12h 20m) > Visualize PCollection with Interactive Beam > --- > > Key: BEAM-7926 > URL: https://issues.apache.org/jira/browse/BEAM-7926 > Project: Beam > Issue Type: New Feature > Components: runner-py-interactive >Reporter: Ning Kang >Assignee: Ning Kang >Priority: Major > Time Spent: 12.5h > Remaining Estimate: 0h > > Support auto plotting / charting of materialized data of a given PCollection > with Interactive Beam. > Say an Interactive Beam pipeline defined as > p = create_pipeline() > pcoll = p | 'Transform' >> transform() > The use can call a single function and get auto-magical charting of the data > as materialized pcoll. > e.g., visualize(pcoll) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7926) Visualize PCollection with Interactive Beam
[ https://issues.apache.org/jira/browse/BEAM-7926?focusedWorklogId=332798=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332798 ] ASF GitHub Bot logged work on BEAM-7926: Author: ASF GitHub Bot Created on: 23/Oct/19 18:48 Start Date: 23/Oct/19 18:48 Worklog Time Spent: 10m Work Description: KevinGG commented on pull request #9741: [BEAM-7926] Visualize PCollection URL: https://github.com/apache/beam/pull/9741#discussion_r338218993 ## File path: sdks/python/apache_beam/runners/interactive/interactive_environment_test.py ## @@ -88,6 +91,72 @@ def test_watch_class_instance(self): self.assertVariableWatched('_var_in_class_instance', self._var_in_class_instance) + def test_fail_to_set_pipeline_result_key_not_pipeline(self): +class NotPipeline(object): + pass + +with self.assertRaises(AssertionError) as ctx: + ie.current_env().set_pipeline_result(NotPipeline(), + runner.PipelineResult( + runner.PipelineState.RUNNING)) + self.assertTrue('pipeline must be an instance of apache_beam.Pipeline ' + 'or its subclass' in ctx.exception) + + def test_fail_to_set_pipeline_result_value_not_pipeline_result(self): +class NotResult(object): + pass + +with self.assertRaises(AssertionError) as ctx: + ie.current_env().set_pipeline_result(self._p, NotResult()) + self.assertTrue('result must be an instance of ' + 'apache_beam.runners.runner.PipelineResult or its ' + 'subclass' in ctx.exception) + + def test_set_pipeline_result_successfully(self): +class PipelineSubClass(beam.Pipeline): + pass + +class PipelineResultSubClass(runner.PipelineResult): + pass + +pipeline = PipelineSubClass() +pipeline_result = PipelineResultSubClass(runner.PipelineState.RUNNING) +ie.current_env().set_pipeline_result(pipeline, pipeline_result) +self.assertIs(ie.current_env().pipeline_result(pipeline), pipeline_result) + + def test_determine_terminal_state(self): +for state in (runner.PipelineState.DONE, + runner.PipelineState.FAILED, + runner.PipelineState.CANCELLED, + runner.PipelineState.UPDATED, + runner.PipelineState.DRAINED): + ie.current_env().set_pipeline_result(self._p, runner.PipelineResult( Review comment: Great, thanks! I totally missed it. I'll change it to use that class method. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 332798) Time Spent: 12h 20m (was: 12h 10m) > Visualize PCollection with Interactive Beam > --- > > Key: BEAM-7926 > URL: https://issues.apache.org/jira/browse/BEAM-7926 > Project: Beam > Issue Type: New Feature > Components: runner-py-interactive >Reporter: Ning Kang >Assignee: Ning Kang >Priority: Major > Time Spent: 12h 20m > Remaining Estimate: 0h > > Support auto plotting / charting of materialized data of a given PCollection > with Interactive Beam. > Say an Interactive Beam pipeline defined as > p = create_pipeline() > pcoll = p | 'Transform' >> transform() > The use can call a single function and get auto-magical charting of the data > as materialized pcoll. > e.g., visualize(pcoll) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7926) Visualize PCollection with Interactive Beam
[ https://issues.apache.org/jira/browse/BEAM-7926?focusedWorklogId=332797=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332797 ] ASF GitHub Bot logged work on BEAM-7926: Author: ASF GitHub Bot Created on: 23/Oct/19 18:47 Start Date: 23/Oct/19 18:47 Worklog Time Spent: 10m Work Description: KevinGG commented on pull request #9741: [BEAM-7926] Visualize PCollection URL: https://github.com/apache/beam/pull/9741#discussion_r338218476 ## File path: sdks/python/apache_beam/runners/interactive/interactive_environment.py ## @@ -61,8 +66,20 @@ def __init__(self, cache_manager=None): self._watching_set = set() # Holds variables list of (Dict[str, object]). self._watching_dict_list = [] +# Holds results of pipeline runs as Dict[Pipeline, PipelineResult]. +# Each key is a pipeline instance defined by the end user. The +# InteractiveRunner is responsible for populating this dictionary +# implicitly. +self._pipeline_results = {} # Always watch __main__ module. self.watch('__main__') +# Do a warning level logging if current python version is below 2 or 3.5.3. +if sys.version_info < (3,): + logging.warning('Interactive Beam does not support Python 2.') +elif sys.version_info < (3, 5, 3): Review comment: We can. I'll make it simpler. Let's just consider every different component a big feature. The feature requires Py 3.5.3. And we NOOP if dependency is not ready. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 332797) Time Spent: 12h 10m (was: 12h) > Visualize PCollection with Interactive Beam > --- > > Key: BEAM-7926 > URL: https://issues.apache.org/jira/browse/BEAM-7926 > Project: Beam > Issue Type: New Feature > Components: runner-py-interactive >Reporter: Ning Kang >Assignee: Ning Kang >Priority: Major > Time Spent: 12h 10m > Remaining Estimate: 0h > > Support auto plotting / charting of materialized data of a given PCollection > with Interactive Beam. > Say an Interactive Beam pipeline defined as > p = create_pipeline() > pcoll = p | 'Transform' >> transform() > The use can call a single function and get auto-magical charting of the data > as materialized pcoll. > e.g., visualize(pcoll) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-8368) [Python] libprotobuf-generated exception when importing apache_beam
[ https://issues.apache.org/jira/browse/BEAM-8368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958159#comment-16958159 ] Brian Hulette commented on BEAM-8368: - Beam 2.17 cut date is today, should we keep this as a blocker and potentially cherry-pick in the next couple of days if there's an arrow release to use? Or just punt until 2.18? > [Python] libprotobuf-generated exception when importing apache_beam > --- > > Key: BEAM-8368 > URL: https://issues.apache.org/jira/browse/BEAM-8368 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Affects Versions: 2.15.0, 2.17.0 >Reporter: Ubaier Bhat >Assignee: Brian Hulette >Priority: Blocker > Fix For: 2.17.0 > > Attachments: error_log.txt > > Time Spent: 1h 50m > Remaining Estimate: 0h > > Unable to import apache_beam after upgrading to macos 10.15 (Catalina). > Cleared all the pipenvs and but can't get it working again. > {code} > import apache_beam as beam > /Users/***/.local/share/virtualenvs/beam-etl-ims6DitU/lib/python3.7/site-packages/apache_beam/__init__.py:84: > UserWarning: Some syntactic constructs of Python 3 are not yet fully > supported by Apache Beam. > 'Some syntactic constructs of Python 3 are not yet fully supported by ' > [libprotobuf ERROR google/protobuf/descriptor_database.cc:58] File already > exists in database: > [libprotobuf FATAL google/protobuf/descriptor.cc:1370] CHECK failed: > GeneratedDatabase()->Add(encoded_file_descriptor, size): > libc++abi.dylib: terminating with uncaught exception of type > google::protobuf::FatalException: CHECK failed: > GeneratedDatabase()->Add(encoded_file_descriptor, size): > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-4775) JobService should support returning metrics
[ https://issues.apache.org/jira/browse/BEAM-4775?focusedWorklogId=332795=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332795 ] ASF GitHub Bot logged work on BEAM-4775: Author: ASF GitHub Bot Created on: 23/Oct/19 18:46 Start Date: 23/Oct/19 18:46 Worklog Time Spent: 10m Work Description: pabloem commented on issue #9843: [BEAM-4775] Converting MonitoringInfos to MetricResults in PortableRunner URL: https://github.com/apache/beam/pull/9843#issuecomment-545583207 We'll have to rely on r:@Ardagan as Alex is not working on this anymore. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 332795) Time Spent: 50h 20m (was: 50h 10m) > JobService should support returning metrics > --- > > Key: BEAM-4775 > URL: https://issues.apache.org/jira/browse/BEAM-4775 > Project: Beam > Issue Type: Bug > Components: beam-model >Reporter: Eugene Kirpichov >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 50h 20m > Remaining Estimate: 0h > > Design doc: [https://s.apache.org/get-metrics-api]. > Further discussion is ongoing on [this > doc|https://docs.google.com/document/d/1m83TsFvJbOlcLfXVXprQm1B7vUakhbLZMzuRrOHWnTg/edit?ts=5c826bb4#heading=h.faqan9rjc6dm]. > We want to report job metrics back to the portability harness from the runner > harness, for displaying to users. > h1. Relevant PRs in flight: > h2. Ready for Review: > * [#8022|https://github.com/apache/beam/pull/8022]: correct the Job RPC > protos from [#8018|https://github.com/apache/beam/pull/8018]. > h2. Iterating / Discussing: > * [#7971|https://github.com/apache/beam/pull/7971]: Flink portable metrics: > get ptransform from MonitoringInfo, not stage name > ** this is a simpler, Flink-specific PR that is basically duplicated inside > each of the following two, so may be worth trying to merge in first > * #[7915|https://github.com/apache/beam/pull/7915]: use MonitoringInfo data > model in Java SDK metrics > * [#7868|https://github.com/apache/beam/pull/7868]: MonitoringInfo URN tweaks > h2. Merged > * [#8018|https://github.com/apache/beam/pull/8018]: add job metrics RPC > protos > * [#7867|https://github.com/apache/beam/pull/7867]: key MetricResult by a > MetricKey > * [#7938|https://github.com/apache/beam/pull/7938]: move MonitoringInfo > protos to model/pipeline module > * [#7883|https://github.com/apache/beam/pull/7883]: Add > MetricQueryResults.allMetrics() helper > * [#7866|https://github.com/apache/beam/pull/7866]: move function helpers > from fn-harness to sdks/java/core > * [#7890|https://github.com/apache/beam/pull/7890]: consolidate MetricResult > implementations > h2. Closed > * [#7934|https://github.com/apache/beam/pull/7934]: job metrics RPC + SDK > support > * [#7876|https://github.com/apache/beam/pull/7876]: Clean up metric protos; > support integer distributions, gauges -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=332793=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332793 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 23/Oct/19 18:45 Start Date: 23/Oct/19 18:45 Worklog Time Spent: 10m Work Description: pabloem commented on pull request #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#discussion_r337775537 ## File path: sdks/python/apache_beam/io/gcp/bigquery_read_it_test.py ## @@ -86,23 +117,61 @@ def create_table(self, tablename): table_schema.fields.append(table_field) table = bigquery.Table( tableReference=bigquery.TableReference( -projectId=self.project, -datasetId=self.dataset_id, -tableId=tablename), +projectId=cls.project, +datasetId=cls.dataset_id, +tableId=table_name), schema=table_schema) request = bigquery.BigqueryTablesInsertRequest( -projectId=self.project, datasetId=self.dataset_id, table=table) -self.bigquery_client.client.tables.Insert(request) +projectId=cls.project, datasetId=cls.dataset_id, table=table) +cls.bigquery_client.client.tables.Insert(request) table_data = [ {'number': 1, 'str': 'abc'}, {'number': 2, 'str': 'def'}, {'number': 3, 'str': u'你好'}, {'number': 4, 'str': u'привет'} ] -self.bigquery_client.insert_rows( -self.project, self.dataset_id, tablename, table_data) +cls.bigquery_client.insert_rows( +cls.project, cls.dataset_id, table_name, table_data) - def create_table_new_types(self, table_name): + def get_expected_data(self): Review comment: Make this a constant, and use it to create the table, and to match in the asserts. ``` TABLE_DATA = [ {'number': 1, 'str': 'abc'}, {'number': 2, 'str': 'def'}, {'number': 3, 'str': u'你好'}, {'number': 4, 'str': u'привет'} ] ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 332793) > Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK > -- > > Key: BEAM-1440 > URL: https://issues.apache.org/jira/browse/BEAM-1440 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Chamikara Madhusanka Jayalath >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 4h 40m > Remaining Estimate: 0h > > Currently we have a BigQuery native source for Python SDK [1]. > This can only be used by Dataflow runner. > We should implement a Beam BigQuery source that implements > iobase.BoundedSource [2] interface so that other runners that try to use > Python SDK can read from BigQuery as well. Java SDK already has a Beam > BigQuery source [3]. > [1] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py > [2] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70 > [3] > https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=332789=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332789 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 23/Oct/19 18:45 Start Date: 23/Oct/19 18:45 Worklog Time Spent: 10m Work Description: pabloem commented on pull request #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#discussion_r337778293 ## File path: sdks/python/apache_beam/io/gcp/bigquery_read_it_test.py ## @@ -86,23 +117,61 @@ def create_table(self, tablename): table_schema.fields.append(table_field) table = bigquery.Table( tableReference=bigquery.TableReference( -projectId=self.project, -datasetId=self.dataset_id, -tableId=tablename), +projectId=cls.project, +datasetId=cls.dataset_id, +tableId=table_name), schema=table_schema) request = bigquery.BigqueryTablesInsertRequest( -projectId=self.project, datasetId=self.dataset_id, table=table) -self.bigquery_client.client.tables.Insert(request) +projectId=cls.project, datasetId=cls.dataset_id, table=table) +cls.bigquery_client.client.tables.Insert(request) table_data = [ {'number': 1, 'str': 'abc'}, {'number': 2, 'str': 'def'}, {'number': 3, 'str': u'你好'}, {'number': 4, 'str': u'привет'} ] -self.bigquery_client.insert_rows( -self.project, self.dataset_id, tablename, table_data) +cls.bigquery_client.insert_rows( +cls.project, cls.dataset_id, table_name, table_data) - def create_table_new_types(self, table_name): + def get_expected_data(self): +return [ +{'number': 1, 'str': 'abc'}, +{'number': 2, 'str': 'def'}, +{'number': 3, 'str': u'你好'}, +{'number': 4, 'str': u'привет'} +] + + @skip(['PortableRunner', 'FlinkRunner']) + @attr('IT') + def test_native_source(self): +with beam.Pipeline(argv=self.args) as p: + result = (p | 'read' >> beam.io.Read(beam.io.BigQuerySource( + query=self.query, use_standard_sql=True))) + assert_that(result, equal_to(self.get_expected_data())) + + @skip(['DirectRunner', 'TestDirectRunner']) + @attr('IT') + def test_iobase_source(self): +with beam.Pipeline(argv=self.args) as p: + result = (p | 'read' >> beam.io.ReadFromBigQuery( + query=self.query, use_standard_sql=True, project=self.project, + gcs_bucket_name='gs://temp-storage-for-end-to-end-tests')) Review comment: Why are we skipping this for DirectRunner? This should work there, right? `gcs_bucket_name` may need to be passed testpipeline arguments, in case it runs in a project that does not have access to that bucket (we run it internally at Google). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 332789) Time Spent: 4.5h (was: 4h 20m) > Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK > -- > > Key: BEAM-1440 > URL: https://issues.apache.org/jira/browse/BEAM-1440 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Chamikara Madhusanka Jayalath >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 4.5h > Remaining Estimate: 0h > > Currently we have a BigQuery native source for Python SDK [1]. > This can only be used by Dataflow runner. > We should implement a Beam BigQuery source that implements > iobase.BoundedSource [2] interface so that other runners that try to use > Python SDK can read from BigQuery as well. Java SDK already has a Beam > BigQuery source [3]. > [1] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py > [2] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70 > [3] > https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=332790=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332790 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 23/Oct/19 18:45 Start Date: 23/Oct/19 18:45 Worklog Time Spent: 10m Work Description: pabloem commented on pull request #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#discussion_r337762067 ## File path: sdks/python/apache_beam/io/gcp/bigquery_tools.py ## @@ -370,7 +383,37 @@ def _start_query_job(self, project_id, query, use_legacy_sql, flatten_results, jobReference=reference)) response = self.client.jobs.Insert(request) -return response.jobReference.jobId, response.jobReference.location +return response + + def wait_for_bq_job(self, job_reference, sleep_duration_sec=5, + max_retries=60): +"""Poll job until it is DONE. + +Args: + job_reference: bigquery.JobReference instance. + sleep_duration_sec: Specifies the delay in seconds between retries. + max_retries: The total number of times to retry. If equals to 0, +the function waits forever. + +Raises: + `RuntimeError`: If the job is FAILED or the number of retries has been +reached. +""" +retry = 0 +while True: + retry += 1 + job = self.get_job(job_reference.projectId, job_reference.jobId, + job_reference.location) + logging.info('Job status: %s', job.status.state) + if job.status.state == 'DONE' and job.status.errorResult: +raise RuntimeError("BigQuery job %s failed. Error Result: %s", Review comment: Python errors don't replace strings that way. You'll have to do ``` raise RuntimeError("BigQuery job %s failed. Error Result: %s" % (job_reference.jobId, job.status.errorResult)) ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 332790) Time Spent: 4.5h (was: 4h 20m) > Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK > -- > > Key: BEAM-1440 > URL: https://issues.apache.org/jira/browse/BEAM-1440 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Chamikara Madhusanka Jayalath >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 4.5h > Remaining Estimate: 0h > > Currently we have a BigQuery native source for Python SDK [1]. > This can only be used by Dataflow runner. > We should implement a Beam BigQuery source that implements > iobase.BoundedSource [2] interface so that other runners that try to use > Python SDK can read from BigQuery as well. Java SDK already has a Beam > BigQuery source [3]. > [1] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py > [2] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70 > [3] > https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=332792=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332792 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 23/Oct/19 18:45 Start Date: 23/Oct/19 18:45 Worklog Time Spent: 10m Work Description: pabloem commented on pull request #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#discussion_r337779603 ## File path: sdks/python/apache_beam/io/gcp/bigquery.py ## @@ -496,6 +505,233 @@ def reader(self, test_bigquery_client=None): kms_key=self.kms_key) +SchemaFields = collections.namedtuple('SchemaFields', 'fields mode name type') + + +def _to_bool(value): + return value == 'true' + + +def _to_decimal(value): + return decimal.Decimal(value) + + +def _to_bytes(value): + """Converts value from str to bytes on Python 3.x. Does nothing on + Python 2.7.""" + return value.encode('utf-8') + + +class _BigQueryRowCoder(coders.Coder): + """A coder for a table row (represented as a dict) from a JSON string which + applies additional conversions. + """ + + def __init__(self, table_schema): +# bigquery.TableSchema type is unpickable so we must translate it to a +# pickable type +self.fields = [SchemaFields(x.fields, x.mode, x.name, x.type) + for x in table_schema.fields] +self._converters = { +'INTEGER': int, +'INT64': int, +'FLOAT': float, +'BOOLEAN': _to_bool, +'NUMERIC': _to_decimal, +'BYTES': _to_bytes, +} + + def decode(self, value): +value = json.loads(value) +for field in self.fields: + if field.name not in value: +# The field exists in the schema, but it doesn't exist in this row. +# It probably means its value was null, as the extract to JSON job +# doesn't preserve null fields +value[field.name] = None +continue + + try: +converter = self._converters[field.type] +value[field.name] = converter(value[field.name]) + except KeyError: +# No need to do any conversion +pass +return value + + def is_deterministic(self): +return True + + def to_type_hint(self): +return dict + + +class _BigQuerySource(BoundedSource): + """Read data from BigQuery. + +This source uses a BigQuery export job to take a snapshot of the table +on GCS, and then reads from each produced JSON file. + +Do note that currently this source does not work with DirectRunner. + + Args: +table (str, callable, ValueProvider): The ID of the table, or a callable + that returns it. The ID must contain only letters ``a-z``, ``A-Z``, + numbers ``0-9``, or underscores ``_``. If dataset argument is + :data:`None` then the table argument must contain the entire table + reference specified as: ``'DATASET.TABLE'`` + or ``'PROJECT:DATASET.TABLE'``. If it's a callable, it must receive one + argument representing an element to be written to BigQuery, and return + a TableReference, or a string table name as specified above. +dataset (str): The ID of the dataset containing this table or + :data:`None` if the table reference is specified entirely by the table + argument. +project (str): The ID of the project containing this table. +query (str): A query to be used instead of arguments table, dataset, and + project. + validate (bool): If :data:`True`, various checks will be done when source + gets initialized (e.g., is table present?). This should be + :data:`True` for most scenarios in order to catch errors as early as + possible (pipeline construction instead of pipeline execution). It + should be :data:`False` if the table is created during pipeline + execution by a previous step. +coder (~apache_beam.coders.coders.Coder): The coder for the table + rows. If :data:`None`, then the default coder is + :class:`~apache_beam.io.gcp.bigquery._BigQueryRowCoder`, + which will interpret every line in a file as a JSON serialized + dictionary. This argument needs a value only in special cases when + returning table rows as dictionaries is not desirable. +use_standard_sql (bool): Specifies whether to use BigQuery's standard SQL + dialect for this query. The default value is :data:`False`. + If set to :data:`True`, the query will use BigQuery's updated SQL + dialect with improved standards compliance. + This parameter is ignored for table inputs. +flatten_results (bool): Flattens all nested and repeated fields in the + query results. The default value is :data:`True`. +kms_key (str): Experimental. Optional Cloud KMS key name for use when + creating new tables. +
[jira] [Work logged] (BEAM-4775) JobService should support returning metrics
[ https://issues.apache.org/jira/browse/BEAM-4775?focusedWorklogId=332794=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332794 ] ASF GitHub Bot logged work on BEAM-4775: Author: ASF GitHub Bot Created on: 23/Oct/19 18:45 Start Date: 23/Oct/19 18:45 Worklog Time Spent: 10m Work Description: pabloem commented on issue #9843: [BEAM-4775] Converting MonitoringInfos to MetricResults in PortableRunner URL: https://github.com/apache/beam/pull/9843#issuecomment-545583207 We'll have to rely on r:@Ardagan This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 332794) Time Spent: 50h 10m (was: 50h) > JobService should support returning metrics > --- > > Key: BEAM-4775 > URL: https://issues.apache.org/jira/browse/BEAM-4775 > Project: Beam > Issue Type: Bug > Components: beam-model >Reporter: Eugene Kirpichov >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 50h 10m > Remaining Estimate: 0h > > Design doc: [https://s.apache.org/get-metrics-api]. > Further discussion is ongoing on [this > doc|https://docs.google.com/document/d/1m83TsFvJbOlcLfXVXprQm1B7vUakhbLZMzuRrOHWnTg/edit?ts=5c826bb4#heading=h.faqan9rjc6dm]. > We want to report job metrics back to the portability harness from the runner > harness, for displaying to users. > h1. Relevant PRs in flight: > h2. Ready for Review: > * [#8022|https://github.com/apache/beam/pull/8022]: correct the Job RPC > protos from [#8018|https://github.com/apache/beam/pull/8018]. > h2. Iterating / Discussing: > * [#7971|https://github.com/apache/beam/pull/7971]: Flink portable metrics: > get ptransform from MonitoringInfo, not stage name > ** this is a simpler, Flink-specific PR that is basically duplicated inside > each of the following two, so may be worth trying to merge in first > * #[7915|https://github.com/apache/beam/pull/7915]: use MonitoringInfo data > model in Java SDK metrics > * [#7868|https://github.com/apache/beam/pull/7868]: MonitoringInfo URN tweaks > h2. Merged > * [#8018|https://github.com/apache/beam/pull/8018]: add job metrics RPC > protos > * [#7867|https://github.com/apache/beam/pull/7867]: key MetricResult by a > MetricKey > * [#7938|https://github.com/apache/beam/pull/7938]: move MonitoringInfo > protos to model/pipeline module > * [#7883|https://github.com/apache/beam/pull/7883]: Add > MetricQueryResults.allMetrics() helper > * [#7866|https://github.com/apache/beam/pull/7866]: move function helpers > from fn-harness to sdks/java/core > * [#7890|https://github.com/apache/beam/pull/7890]: consolidate MetricResult > implementations > h2. Closed > * [#7934|https://github.com/apache/beam/pull/7934]: job metrics RPC + SDK > support > * [#7876|https://github.com/apache/beam/pull/7876]: Clean up metric protos; > support integer distributions, gauges -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=332791=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332791 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 23/Oct/19 18:45 Start Date: 23/Oct/19 18:45 Worklog Time Spent: 10m Work Description: pabloem commented on pull request #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#discussion_r337764174 ## File path: sdks/python/apache_beam/io/gcp/bigquery_tools.py ## @@ -695,10 +769,12 @@ def get_or_create_table( def run_query(self, project_id, query, use_legacy_sql, flatten_results, dry_run=False): -job_id, location = self._start_query_job(project_id, query, - use_legacy_sql, flatten_results, - job_id=uuid.uuid4().hex, - dry_run=dry_run) +job = self._start_query_job(project_id, query, use_legacy_sql, Review comment: Did this API change? Did it return a job_id, location and now it returns the whole job object? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 332791) Time Spent: 4.5h (was: 4h 20m) > Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK > -- > > Key: BEAM-1440 > URL: https://issues.apache.org/jira/browse/BEAM-1440 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Chamikara Madhusanka Jayalath >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 4.5h > Remaining Estimate: 0h > > Currently we have a BigQuery native source for Python SDK [1]. > This can only be used by Dataflow runner. > We should implement a Beam BigQuery source that implements > iobase.BoundedSource [2] interface so that other runners that try to use > Python SDK can read from BigQuery as well. Java SDK already has a Beam > BigQuery source [3]. > [1] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py > [2] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70 > [3] > https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-1440) Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1440?focusedWorklogId=332788=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332788 ] ASF GitHub Bot logged work on BEAM-1440: Author: ASF GitHub Bot Created on: 23/Oct/19 18:45 Start Date: 23/Oct/19 18:45 Worklog Time Spent: 10m Work Description: pabloem commented on pull request #9772: [BEAM-1440] Create a BigQuery source that implements iobase.BoundedSource for Python URL: https://github.com/apache/beam/pull/9772#discussion_r337760626 ## File path: sdks/python/test-suites/portable/py37/build.gradle ## @@ -30,3 +33,25 @@ task preCommitPy37() { dependsOn portableWordCountBatch dependsOn portableWordCountStreaming } + +task postCommitIT { + dependsOn 'installGcpTest' + dependsOn 'setupVirtualenv' + dependsOn ':runners:flink:1.8:job-server:shadowJar' + + doLast { +def tests = [ +"apache_beam.io.gcp.bigquery_read_it_test", +] +def testOpts = ["--tests=${tests.join(',')}"] +def cmdArgs = mapToArgString([ +"test_opts": testOpts, +"suite": "postCommitIT-flink-py37", +"pipeline_opts": "--runner=FlinkRunner --project=apache-beam-testing --environment_type=LOOPBACK", Review comment: Have you ran this test with Flink? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 332788) Time Spent: 4h 20m (was: 4h 10m) > Create a BigQuery source (that implements iobase.BoundedSource) for Python SDK > -- > > Key: BEAM-1440 > URL: https://issues.apache.org/jira/browse/BEAM-1440 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Chamikara Madhusanka Jayalath >Assignee: Kamil Wasilewski >Priority: Major > Time Spent: 4h 20m > Remaining Estimate: 0h > > Currently we have a BigQuery native source for Python SDK [1]. > This can only be used by Dataflow runner. > We should implement a Beam BigQuery source that implements > iobase.BoundedSource [2] interface so that other runners that try to use > Python SDK can read from BigQuery as well. Java SDK already has a Beam > BigQuery source [3]. > [1] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py > [2] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L70 > [3] > https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1189 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-7926) Visualize PCollection with Interactive Beam
[ https://issues.apache.org/jira/browse/BEAM-7926?focusedWorklogId=332787=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332787 ] ASF GitHub Bot logged work on BEAM-7926: Author: ASF GitHub Bot Created on: 23/Oct/19 18:44 Start Date: 23/Oct/19 18:44 Worklog Time Spent: 10m Work Description: KevinGG commented on pull request #9741: [BEAM-7926] Visualize PCollection URL: https://github.com/apache/beam/pull/9741#discussion_r338217168 ## File path: sdks/python/apache_beam/runners/interactive/display/pcoll_visualization_test.py ## @@ -0,0 +1,152 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +"""Tests for apache_beam.runners.interactive.display.pcoll_visualization.""" +from __future__ import absolute_import + +import sys +import time +import unittest + +import apache_beam as beam # pylint: disable=ungrouped-imports +import timeloop +from apache_beam.runners import runner +from apache_beam.runners.interactive import interactive_environment as ie +from apache_beam.runners.interactive.display import pcoll_visualization as pv + +# Work around nose tests using Python2 without unittest.mock module. +try: + from unittest.mock import patch +except ImportError: + from mock import patch + + +class PCollVisualizationTest(unittest.TestCase): + + def setUp(self): +self._p = beam.Pipeline() +# pylint: disable=range-builtin-not-iterating +self._pcoll = self._p | 'Create' >> beam.Create(range(1000)) + + @unittest.skipIf(sys.version_info < (3, 5, 3), + 'PCollVisualization is not supported on Python 2.') + def test_raise_error_for_non_pcoll_input(self): +class Foo(object): + pass + +with self.assertRaises(AssertionError) as ctx: + pv.PCollVisualization(Foo()) + self.assertTrue('pcoll should be apache_beam.pvalue.PCollection' in + ctx.exception) + + @unittest.skipIf(sys.version_info < (3, 5, 3), + 'PCollVisualization is not supported on Python 2.') + def test_pcoll_visualization_generate_unique_display_id(self): +pv_1 = pv.PCollVisualization(self._pcoll) +pv_2 = pv.PCollVisualization(self._pcoll) +self.assertNotEqual(pv_1._dive_display_id, pv_2._dive_display_id) +self.assertNotEqual(pv_1._overview_display_id, pv_2._overview_display_id) +self.assertNotEqual(pv_1._df_display_id, pv_2._df_display_id) + + @unittest.skipIf(sys.version_info < (3, 5, 3), + 'PCollVisualization is not supported on Python 2.') + @patch('apache_beam.runners.interactive.display.pcoll_visualization' + '.PCollVisualization._to_element_list', lambda x: [1, 2, 3]) + def test_one_shot_visualization_not_return_handle(self): +self.assertIsNone(pv.visualize(self._pcoll)) + + def _mock_to_element_list(self): +yield [1, 2, 3] +yield [1, 2, 3, 4] +yield [1, 2, 3, 4, 5] +yield [1, 2, 3, 4, 5, 6] +yield [1, 2, 3, 4, 5, 6, 7] +yield [1, 2, 3, 4, 5, 6, 7, 8] + + @unittest.skipIf(sys.version_info < (3, 5, 3), + 'PCollVisualization is not supported on Python 2.') + @patch('apache_beam.runners.interactive.display.pcoll_visualization' + '.PCollVisualization._to_element_list', _mock_to_element_list) + def test_dynamic_plotting_return_handle(self): +h = pv.visualize(self._pcoll, dynamic_plotting_interval=1) +self.assertIsInstance(h, timeloop.Timeloop) +h.stop() + + @unittest.skipIf(sys.version_info < (3, 5, 3), + 'PCollVisualization is not supported on Python 2.') + @patch('apache_beam.runners.interactive.display.pcoll_visualization' + '.PCollVisualization._to_element_list', _mock_to_element_list) + @patch('apache_beam.runners.interactive.display.pcoll_visualization' + '.PCollVisualization.display_facets') + def test_dynamic_plotting_update_same_display(self, +mocked_display_facets): +fake_pipeline_result = runner.PipelineResult(runner.PipelineState.RUNNING) +ie.current_env().set_pipeline_result(self._p, fake_pipeline_result) +# Starts async dynamic plotting that never
[jira] [Work logged] (BEAM-7926) Visualize PCollection with Interactive Beam
[ https://issues.apache.org/jira/browse/BEAM-7926?focusedWorklogId=332777=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332777 ] ASF GitHub Bot logged work on BEAM-7926: Author: ASF GitHub Bot Created on: 23/Oct/19 18:37 Start Date: 23/Oct/19 18:37 Worklog Time Spent: 10m Work Description: KevinGG commented on pull request #9741: [BEAM-7926] Visualize PCollection URL: https://github.com/apache/beam/pull/9741#discussion_r338213864 ## File path: sdks/python/apache_beam/runners/interactive/display/pcoll_visualization.py ## @@ -0,0 +1,279 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +"""Module visualizes PCollection data. + +For internal use only; no backwards-compatibility guarantees. +Only works with Python 3.5+. +""" +from __future__ import absolute_import + +import base64 +import logging +from datetime import timedelta + +from pandas.io.json import json_normalize + +from apache_beam import pvalue +from apache_beam.runners.interactive import interactive_environment as ie +from apache_beam.runners.interactive import pipeline_instrument as instr + +# jsons doesn't support < Python 3.5. Work around with json for legacy tests. +# TODO(BEAM-8288): clean up once Py2 is deprecated from Beam. +try: + import jsons + _pv_jsons_load = jsons.load + _pv_jsons_dump = jsons.dump +except ImportError: + import json + _pv_jsons_load = json.load + _pv_jsons_dump = json.dump + +try: + from facets_overview.generic_feature_statistics_generator import GenericFeatureStatisticsGenerator + _facets_gfsg_ready = True +except ImportError: + _facets_gfsg_ready = False + +try: + from IPython.core.display import HTML + from IPython.core.display import Javascript + from IPython.core.display import display + from IPython.core.display import display_javascript + from IPython.core.display import update_display + _ipython_ready = True +except ImportError: + _ipython_ready = False + +try: + from timeloop import Timeloop + _tl_ready = True +except ImportError: + _tl_ready = False + +# 1-d types that need additional normalization to be compatible with DataFrame. +_one_dimension_types = (int, float, str, bool, list, tuple) + +_DIVE_SCRIPT_TEMPLATE = """ +document.querySelector("#{display_id}").data = {jsonstr};""" +_DIVE_HTML_TEMPLATE = """ +https://cdnjs.cloudflare.com/ajax/libs/webcomponentsjs/1.3.3/webcomponents-lite.js"> +https://raw.githubusercontent.com/PAIR-code/facets/1.0.0/facets-dist/facets-jupyter.html;> + + + document.querySelector("#{display_id}").data = {jsonstr}; +""" +_OVERVIEW_SCRIPT_TEMPLATE = """ + document.querySelector("#{display_id}").protoInput = "{protostr}"; + """ +_OVERVIEW_HTML_TEMPLATE = """ +https://cdnjs.cloudflare.com/ajax/libs/webcomponentsjs/1.3.3/webcomponents-lite.js"> +https://raw.githubusercontent.com/PAIR-code/facets/1.0.0/facets-dist/facets-jupyter.html;> + + + document.querySelector("#{display_id}").protoInput = "{protostr}"; +""" +_DATAFRAME_PAGINATION_TEMPLATE = """ +https://ajax.googleapis.com/ajax/libs/jquery/2.2.2/jquery.min.js"> +https://cdn.datatables.net/1.10.16/js/jquery.dataTables.js"> +https://cdn.datatables.net/1.10.16/css/jquery.dataTables.css;> +{dataframe_html} + + $("#{table_id}").DataTable(); +""" + + +def visualize(pcoll, dynamic_plotting_interval=None): + """Visualizes the data of a given PCollection. Optionally enables dynamic + plotting with interval in seconds if the PCollection is being produced by a + running pipeline or the pipeline is streaming indefinitely. The function + always returns immediately and is asynchronous when dynamic plotting is on. + + If dynamic plotting enabled, the visualization is updated continuously until + the pipeline producing the PCollection is in an end state. The visualization + would be anchored to the notebook cell output
[jira] [Work logged] (BEAM-7926) Visualize PCollection with Interactive Beam
[ https://issues.apache.org/jira/browse/BEAM-7926?focusedWorklogId=332775=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332775 ] ASF GitHub Bot logged work on BEAM-7926: Author: ASF GitHub Bot Created on: 23/Oct/19 18:32 Start Date: 23/Oct/19 18:32 Worklog Time Spent: 10m Work Description: KevinGG commented on pull request #9741: [BEAM-7926] Visualize PCollection URL: https://github.com/apache/beam/pull/9741#discussion_r338210890 ## File path: sdks/python/apache_beam/runners/interactive/display/pcoll_visualization.py ## @@ -0,0 +1,279 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +"""Module visualizes PCollection data. + +For internal use only; no backwards-compatibility guarantees. +Only works with Python 3.5+. +""" +from __future__ import absolute_import + +import base64 +import logging +from datetime import timedelta + +from pandas.io.json import json_normalize + +from apache_beam import pvalue +from apache_beam.runners.interactive import interactive_environment as ie +from apache_beam.runners.interactive import pipeline_instrument as instr + +# jsons doesn't support < Python 3.5. Work around with json for legacy tests. +# TODO(BEAM-8288): clean up once Py2 is deprecated from Beam. +try: + import jsons + _pv_jsons_load = jsons.load + _pv_jsons_dump = jsons.dump +except ImportError: + import json + _pv_jsons_load = json.load + _pv_jsons_dump = json.dump + +try: + from facets_overview.generic_feature_statistics_generator import GenericFeatureStatisticsGenerator + _facets_gfsg_ready = True +except ImportError: + _facets_gfsg_ready = False + +try: + from IPython.core.display import HTML + from IPython.core.display import Javascript + from IPython.core.display import display + from IPython.core.display import display_javascript + from IPython.core.display import update_display + _ipython_ready = True +except ImportError: + _ipython_ready = False + +try: + from timeloop import Timeloop + _tl_ready = True +except ImportError: + _tl_ready = False + +# 1-d types that need additional normalization to be compatible with DataFrame. +_one_dimension_types = (int, float, str, bool, list, tuple) + +_DIVE_SCRIPT_TEMPLATE = """ +document.querySelector("#{display_id}").data = {jsonstr};""" +_DIVE_HTML_TEMPLATE = """ +https://cdnjs.cloudflare.com/ajax/libs/webcomponentsjs/1.3.3/webcomponents-lite.js"> Review comment: The `facets` templates originate from examples in notebooks shared among the OSS community. The `jquery` template simply uses a jquery DataTable widget. For unittest, the logic of each template is too simple that there is nothing really to be tested. The logic I added to set display_id and update existing widgets have been unittested. We'll have some integration test later to execute notebook and verify output area. Such integration test framework is being developed by some other team. We'll merge our efforts there. For manual test, I've manually tested them from notebook. I can attach png and gif if needed. Off the topic: In future PRs, we might have `%%javascript` magic involved in the python code. Then we can have some library such as `js2py` for javascript unittests in Python. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 332775) Time Spent: 11h 40m (was: 11.5h) > Visualize PCollection with Interactive Beam > --- > > Key: BEAM-7926 > URL: https://issues.apache.org/jira/browse/BEAM-7926 > Project: Beam > Issue Type: New Feature > Components: runner-py-interactive >Reporter: Ning Kang >Assignee: Ning Kang >Priority: Major > Time
[jira] [Work logged] (BEAM-7926) Visualize PCollection with Interactive Beam
[ https://issues.apache.org/jira/browse/BEAM-7926?focusedWorklogId=332774=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332774 ] ASF GitHub Bot logged work on BEAM-7926: Author: ASF GitHub Bot Created on: 23/Oct/19 18:32 Start Date: 23/Oct/19 18:32 Worklog Time Spent: 10m Work Description: KevinGG commented on pull request #9741: [BEAM-7926] Visualize PCollection URL: https://github.com/apache/beam/pull/9741#discussion_r338210890 ## File path: sdks/python/apache_beam/runners/interactive/display/pcoll_visualization.py ## @@ -0,0 +1,279 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +"""Module visualizes PCollection data. + +For internal use only; no backwards-compatibility guarantees. +Only works with Python 3.5+. +""" +from __future__ import absolute_import + +import base64 +import logging +from datetime import timedelta + +from pandas.io.json import json_normalize + +from apache_beam import pvalue +from apache_beam.runners.interactive import interactive_environment as ie +from apache_beam.runners.interactive import pipeline_instrument as instr + +# jsons doesn't support < Python 3.5. Work around with json for legacy tests. +# TODO(BEAM-8288): clean up once Py2 is deprecated from Beam. +try: + import jsons + _pv_jsons_load = jsons.load + _pv_jsons_dump = jsons.dump +except ImportError: + import json + _pv_jsons_load = json.load + _pv_jsons_dump = json.dump + +try: + from facets_overview.generic_feature_statistics_generator import GenericFeatureStatisticsGenerator + _facets_gfsg_ready = True +except ImportError: + _facets_gfsg_ready = False + +try: + from IPython.core.display import HTML + from IPython.core.display import Javascript + from IPython.core.display import display + from IPython.core.display import display_javascript + from IPython.core.display import update_display + _ipython_ready = True +except ImportError: + _ipython_ready = False + +try: + from timeloop import Timeloop + _tl_ready = True +except ImportError: + _tl_ready = False + +# 1-d types that need additional normalization to be compatible with DataFrame. +_one_dimension_types = (int, float, str, bool, list, tuple) + +_DIVE_SCRIPT_TEMPLATE = """ +document.querySelector("#{display_id}").data = {jsonstr};""" +_DIVE_HTML_TEMPLATE = """ +https://cdnjs.cloudflare.com/ajax/libs/webcomponentsjs/1.3.3/webcomponents-lite.js"> Review comment: The `facets` templates originate from examples in notebooks shared among the OSS community. The `jquery` template simply uses a jquery DataTable widget. For unittest, the logic of each template is too simple that there is nothing really to be tested. The logic I've added to set display_id and update existing widgets have been unittested. We'll have some integration test later to execute notebook and verify output area. Such integration test framework is being developed by some other team. We'll merge our efforts there. For manual test, I've manually tested them from notebook. I can attach png and gif if needed. Off the topic: In future PRs, we might have `%%javascript` magic involved in the python code. Then we can have some library such as `js2py` for javascript unittests in Python. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 332774) Time Spent: 11.5h (was: 11h 20m) > Visualize PCollection with Interactive Beam > --- > > Key: BEAM-7926 > URL: https://issues.apache.org/jira/browse/BEAM-7926 > Project: Beam > Issue Type: New Feature > Components: runner-py-interactive >Reporter: Ning Kang >Assignee: Ning Kang >Priority: Major > Time
[jira] [Comment Edited] (BEAM-8397) DataflowRunnerTest.test_remote_runner_display_data fails due to infinite recursion during pickling.
[ https://issues.apache.org/jira/browse/BEAM-8397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958087#comment-16958087 ] Valentyn Tymofieiev edited comment on BEAM-8397 at 10/23/19 6:31 PM: - It appears that test_remote_runner_display_data triggers infinite recursion in _find_containing_class / _find_containing_class_inner methods. This recursion is present in Python 2 and all supported Python 3 versions, however Python is not reliably detecting this recursion, and so the recursion causes crashes and/or goes unnoticed depending on Python version and sys.getrecursionlimit(). To observe that there is a recursion problem, we can add some logging into _find_containing_class_inner. With changes in [1] we have: {noformat} python ./setup.py nosetests --nocapture --tests 'apache_beam/runners/dataflow/dataflow_runner_test.py:DataflowRunnerTest.test_remote_runner_display_data' == FAIL: test_remote_runner_display_data (apache_beam.runners.dataflow.dataflow_runner_test.DataflowRunnerTest) -- Traceback (most recent call last): File "/usr/local/google/home/valentyn/projects/beam/beam3/beam/sdks/python/apache_beam/runners/dataflow/dataflow_runner_test.py", line 283, in test_remote_runner_display_data self.assertEqual(len(disp_data), 4) AssertionError: 3 != 4 (This is intentional, otherwise the test passes) >> begin captured logging << ... root: WARNING: _find_containing_class_inner called 5312 times root: WARNING: _find_containing_class_inner called 5313 times root: WARNING: _find_containing_class_inner called 5314 times (This correlates with the increased number of recursion level set to in [1]) - >> end captured logging << ... {noformat} I reproduced this further in the following counterexample: {noformat} from apache_beam.internal import pickler class Outter(): def method(self): class InnerA(object): def __init__(self): pass class InnerB(InnerA): def __init__(self): super(InnerB, self).__init__() o = InnerB() pickler.loads(pickler.dumps(o)) c = Outter() c.method() {noformat} Running the above via python -m test fails (on Py2 and Py3.x) with: {noformat} RuntimeError: maximum recursion depth exceeded while getting the str of an object {noformat} I think the long-term course of action here should be removing patches[2] of dill (contributing the patches upstream, or switching to a different pickler: BEAM-8123). In short term, to unblock BEAM-5878 we can remove nested inner classes from test_remote_runner_display_data and proceed with planned dill upgrade to 0.3.1.1. As noted above, the upgrade itself does not trigger the test failure, as the infinite recursion is already happening but for some reason in most cases the test does not fail. It would be nice also to file an issue against CPython to improve detection of infinite recursion, with a repro they can follow. cc: [~robertwb] [~altay] [~udim] FYI. [1] [https://github.com/apache/beam/pull/9859] [2] https://github.com/apache/beam/blob/eb40876109a557586c33b77923025508712184ab/sdks/python/apache_beam/internal/pickler.py#L126 was (Author: tvalentyn): It appears that
[jira] [Work logged] (BEAM-7926) Visualize PCollection with Interactive Beam
[ https://issues.apache.org/jira/browse/BEAM-7926?focusedWorklogId=332772=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332772 ] ASF GitHub Bot logged work on BEAM-7926: Author: ASF GitHub Bot Created on: 23/Oct/19 18:31 Start Date: 23/Oct/19 18:31 Worklog Time Spent: 10m Work Description: KevinGG commented on pull request #9741: [BEAM-7926] Visualize PCollection URL: https://github.com/apache/beam/pull/9741#discussion_r338210890 ## File path: sdks/python/apache_beam/runners/interactive/display/pcoll_visualization.py ## @@ -0,0 +1,279 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +"""Module visualizes PCollection data. + +For internal use only; no backwards-compatibility guarantees. +Only works with Python 3.5+. +""" +from __future__ import absolute_import + +import base64 +import logging +from datetime import timedelta + +from pandas.io.json import json_normalize + +from apache_beam import pvalue +from apache_beam.runners.interactive import interactive_environment as ie +from apache_beam.runners.interactive import pipeline_instrument as instr + +# jsons doesn't support < Python 3.5. Work around with json for legacy tests. +# TODO(BEAM-8288): clean up once Py2 is deprecated from Beam. +try: + import jsons + _pv_jsons_load = jsons.load + _pv_jsons_dump = jsons.dump +except ImportError: + import json + _pv_jsons_load = json.load + _pv_jsons_dump = json.dump + +try: + from facets_overview.generic_feature_statistics_generator import GenericFeatureStatisticsGenerator + _facets_gfsg_ready = True +except ImportError: + _facets_gfsg_ready = False + +try: + from IPython.core.display import HTML + from IPython.core.display import Javascript + from IPython.core.display import display + from IPython.core.display import display_javascript + from IPython.core.display import update_display + _ipython_ready = True +except ImportError: + _ipython_ready = False + +try: + from timeloop import Timeloop + _tl_ready = True +except ImportError: + _tl_ready = False + +# 1-d types that need additional normalization to be compatible with DataFrame. +_one_dimension_types = (int, float, str, bool, list, tuple) + +_DIVE_SCRIPT_TEMPLATE = """ +document.querySelector("#{display_id}").data = {jsonstr};""" +_DIVE_HTML_TEMPLATE = """ +https://cdnjs.cloudflare.com/ajax/libs/webcomponentsjs/1.3.3/webcomponents-lite.js"> Review comment: The `facets` templates originate from examples in notebooks shared among the OSS community. The `jquery` template simply uses a jquery DataTable widget. For unittest, the logic of each template is too simple that there is nothing really to be tested. We'll have some integration test later to execute notebook and verify output area. Such integration test framework is being developed by some other team. We'll merge our efforts there. For manual test, I've manually tested them from notebook. I can attach png and gif if needed. Off the topic: In future PRs, we might have `%%javascript` magic involved in the python code. Then we can have some library such as `js2py` for javascript unittests in Python. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 332772) Time Spent: 11h 20m (was: 11h 10m) > Visualize PCollection with Interactive Beam > --- > > Key: BEAM-7926 > URL: https://issues.apache.org/jira/browse/BEAM-7926 > Project: Beam > Issue Type: New Feature > Components: runner-py-interactive >Reporter: Ning Kang >Assignee: Ning Kang >Priority: Major > Time Spent: 11h 20m > Remaining Estimate: 0h > > Support auto plotting / charting of
[jira] [Work logged] (BEAM-8467) Enable reading compressed files with Python fileio
[ https://issues.apache.org/jira/browse/BEAM-8467?focusedWorklogId=332771=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332771 ] ASF GitHub Bot logged work on BEAM-8467: Author: ASF GitHub Bot Created on: 23/Oct/19 18:31 Start Date: 23/Oct/19 18:31 Worklog Time Spent: 10m Work Description: pabloem commented on pull request #9861: [BEAM-8467] Enabling reading compressed files URL: https://github.com/apache/beam/pull/9861 r: @chamikaramj Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily: - [ ] [**Choose reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and mention them in a comment (`R: @username`). - [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). See the [Contributor Guide](https://beam.apache.org/contribute) for more tips on [how to make review process smoother](https://beam.apache.org/contribute/#make-reviewers-job-easier). Post-Commit Tests Status (on master branch) Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark --- | --- | --- | --- | --- | --- | --- | --- Go | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/) Java | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/) Python | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/)[![Build
[jira] [Created] (BEAM-8467) Enable reading compressed files with Python fileio
Pablo Estrada created BEAM-8467: --- Summary: Enable reading compressed files with Python fileio Key: BEAM-8467 URL: https://issues.apache.org/jira/browse/BEAM-8467 Project: Beam Issue Type: Improvement Components: io-py-files Reporter: Pablo Estrada Assignee: Pablo Estrada -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-8467) Enable reading compressed files with Python fileio
[ https://issues.apache.org/jira/browse/BEAM-8467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pablo Estrada updated BEAM-8467: Status: Open (was: Triage Needed) > Enable reading compressed files with Python fileio > -- > > Key: BEAM-8467 > URL: https://issues.apache.org/jira/browse/BEAM-8467 > Project: Beam > Issue Type: Improvement > Components: io-py-files >Reporter: Pablo Estrada >Assignee: Pablo Estrada >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-3713) Consider moving away from nose to nose2 or pytest.
[ https://issues.apache.org/jira/browse/BEAM-3713?focusedWorklogId=332769=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332769 ] ASF GitHub Bot logged work on BEAM-3713: Author: ASF GitHub Bot Created on: 23/Oct/19 18:29 Start Date: 23/Oct/19 18:29 Worklog Time Spent: 10m Work Description: tvalentyn commented on pull request #9756: [BEAM-3713] Add pytest for unit tests URL: https://github.com/apache/beam/pull/9756#discussion_r338208964 ## File path: sdks/python/apache_beam/runners/dataflow/dataflow_runner_test.py ## @@ -228,6 +229,8 @@ def test_biqquery_read_streaming_fail(self): r'source is not currently available'): p.run() + @pytest.mark.skipif(sys.version_info >= (3, 7), + reason='TODO(BEAM-8095): Segfaults in Python 3.7') Review comment: Regarding problems with this test, see: https://issues.apache.org/jira/browse/BEAM-8397?focusedCommentId=16958087=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16958087 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 332769) Time Spent: 10h 40m (was: 10.5h) > Consider moving away from nose to nose2 or pytest. > -- > > Key: BEAM-3713 > URL: https://issues.apache.org/jira/browse/BEAM-3713 > Project: Beam > Issue Type: Test > Components: sdk-py-core, testing >Reporter: Robert Bradshaw >Assignee: Udi Meiri >Priority: Minor > Time Spent: 10h 40m > Remaining Estimate: 0h > > Per > [https://nose.readthedocs.io/en/latest/|https://nose.readthedocs.io/en/latest/,] > , nose is in maintenance mode. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-3713) Consider moving away from nose to nose2 or pytest.
[ https://issues.apache.org/jira/browse/BEAM-3713?focusedWorklogId=332768=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-332768 ] ASF GitHub Bot logged work on BEAM-3713: Author: ASF GitHub Bot Created on: 23/Oct/19 18:29 Start Date: 23/Oct/19 18:29 Worklog Time Spent: 10m Work Description: tvalentyn commented on pull request #9756: [BEAM-3713] Add pytest for unit tests URL: https://github.com/apache/beam/pull/9756#discussion_r338208964 ## File path: sdks/python/apache_beam/runners/dataflow/dataflow_runner_test.py ## @@ -228,6 +229,8 @@ def test_biqquery_read_streaming_fail(self): r'source is not currently available'): p.run() + @pytest.mark.skipif(sys.version_info >= (3, 7), + reason='TODO(BEAM-8095): Segfaults in Python 3.7') Review comment: Regarding problems with this test see: https://issues.apache.org/jira/browse/BEAM-8397?focusedCommentId=16958087=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16958087 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 332768) Time Spent: 10.5h (was: 10h 20m) > Consider moving away from nose to nose2 or pytest. > -- > > Key: BEAM-3713 > URL: https://issues.apache.org/jira/browse/BEAM-3713 > Project: Beam > Issue Type: Test > Components: sdk-py-core, testing >Reporter: Robert Bradshaw >Assignee: Udi Meiri >Priority: Minor > Time Spent: 10.5h > Remaining Estimate: 0h > > Per > [https://nose.readthedocs.io/en/latest/|https://nose.readthedocs.io/en/latest/,] > , nose is in maintenance mode. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-8457) Instrument Dataflow jobs that are launched from Notebooks
[ https://issues.apache.org/jira/browse/BEAM-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958147#comment-16958147 ] Pablo Estrada commented on BEAM-8457: - yes, this change is for the current release. > Instrument Dataflow jobs that are launched from Notebooks > - > > Key: BEAM-8457 > URL: https://issues.apache.org/jira/browse/BEAM-8457 > Project: Beam > Issue Type: Improvement > Components: runner-py-interactive >Reporter: Ning Kang >Assignee: Ning Kang >Priority: Major > Fix For: 2.17.0 > > Time Spent: 1h 20m > Remaining Estimate: 0h > > Dataflow needs the capability to tell how many Dataflow jobs are launched > from the Notebook environment, i.e., the Interactive Runner. > # Change the pipeline.run() API to allow supply a runner and an option > parameter so that a pipeline initially bundled w/ an interactive runner can > be directly run by other runners from notebook. > # Implicitly add the necessary source information through user labels when > the user does p.run(runner=DataflowRunner()). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (BEAM-8367) Python BigQuery sink should use unique IDs for mode STREAMING_INSERTS
[ https://issues.apache.org/jira/browse/BEAM-8367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pablo Estrada resolved BEAM-8367. - Resolution: Fixed > Python BigQuery sink should use unique IDs for mode STREAMING_INSERTS > - > > Key: BEAM-8367 > URL: https://issues.apache.org/jira/browse/BEAM-8367 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Chamikara Madhusanka Jayalath >Assignee: Pablo Estrada >Priority: Blocker > Fix For: 2.17.0 > > Time Spent: 2h > Remaining Estimate: 0h > > Unique IDs ensure (best effort) that writes to BigQuery are idempotent, for > example, we don't write the same record twice in a VM failure. > > Currently Python BQ sink insert BQ IDs here but they'll be re-generated in a > VM failure resulting in data duplication. > [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py#L766] > > Correct fix is to do a Reshuffle to checkpoint unique IDs once they are > generated, similar to how Java BQ sink operates. > [https://github.com/apache/beam/blob/dcf6ad301069e4d2cfaec5db6b178acb7bb67f49/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/StreamingWriteTables.java#L225] > > Pablo, can you do an initial assessment here ? > I think this is a relatively small fix but I might be wrong. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-8367) Python BigQuery sink should use unique IDs for mode STREAMING_INSERTS
[ https://issues.apache.org/jira/browse/BEAM-8367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958146#comment-16958146 ] Pablo Estrada commented on BEAM-8367: - Yes. Fixed. Thanks! > Python BigQuery sink should use unique IDs for mode STREAMING_INSERTS > - > > Key: BEAM-8367 > URL: https://issues.apache.org/jira/browse/BEAM-8367 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Chamikara Madhusanka Jayalath >Assignee: Pablo Estrada >Priority: Blocker > Fix For: 2.17.0 > > Time Spent: 2h > Remaining Estimate: 0h > > Unique IDs ensure (best effort) that writes to BigQuery are idempotent, for > example, we don't write the same record twice in a VM failure. > > Currently Python BQ sink insert BQ IDs here but they'll be re-generated in a > VM failure resulting in data duplication. > [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py#L766] > > Correct fix is to do a Reshuffle to checkpoint unique IDs once they are > generated, similar to how Java BQ sink operates. > [https://github.com/apache/beam/blob/dcf6ad301069e4d2cfaec5db6b178acb7bb67f49/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/StreamingWriteTables.java#L225] > > Pablo, can you do an initial assessment here ? > I think this is a relatively small fix but I might be wrong. -- This message was sent by Atlassian Jira (v8.3.4#803005)