[jira] [Work logged] (BEAM-9258) [Python] PTransform that connects to Cloud DLP deidentification service
[ https://issues.apache.org/jira/browse/BEAM-9258?focusedWorklogId=394342&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-394342 ] ASF GitHub Bot logged work on BEAM-9258: Author: ASF GitHub Bot Created on: 27/Feb/20 17:29 Start Date: 27/Feb/20 17:29 Worklog Time Spent: 10m Work Description: aaltay commented on issue #10961: [BEAM-9258] Add integration test for Cloud DLP URL: https://github.com/apache/beam/pull/10961#issuecomment-592080779 Actually I can merge. All tests are passing and it is reviewed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 394342) Time Spent: 4h 10m (was: 4h) > [Python] PTransform that connects to Cloud DLP deidentification service > --- > > Key: BEAM-9258 > URL: https://issues.apache.org/jira/browse/BEAM-9258 > Project: Beam > Issue Type: Sub-task > Components: io-py-gcp >Reporter: Michał Walenia >Assignee: Michał Walenia >Priority: Major > Fix For: 2.20.0 > > Time Spent: 4h 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9258) [Python] PTransform that connects to Cloud DLP deidentification service
[ https://issues.apache.org/jira/browse/BEAM-9258?focusedWorklogId=394341&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-394341 ] ASF GitHub Bot logged work on BEAM-9258: Author: ASF GitHub Bot Created on: 27/Feb/20 17:28 Start Date: 27/Feb/20 17:28 Worklog Time Spent: 10m Work Description: aaltay commented on pull request #10961: [BEAM-9258] Add integration test for Cloud DLP URL: https://github.com/apache/beam/pull/10961 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 394341) Time Spent: 4h (was: 3h 50m) > [Python] PTransform that connects to Cloud DLP deidentification service > --- > > Key: BEAM-9258 > URL: https://issues.apache.org/jira/browse/BEAM-9258 > Project: Beam > Issue Type: Sub-task > Components: io-py-gcp >Reporter: Michał Walenia >Assignee: Michał Walenia >Priority: Major > Fix For: 2.20.0 > > Time Spent: 4h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9258) [Python] PTransform that connects to Cloud DLP deidentification service
[ https://issues.apache.org/jira/browse/BEAM-9258?focusedWorklogId=394007&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-394007 ] ASF GitHub Bot logged work on BEAM-9258: Author: ASF GitHub Bot Created on: 27/Feb/20 07:49 Start Date: 27/Feb/20 07:49 Worklog Time Spent: 10m Work Description: mwalenia commented on issue #10961: [BEAM-9258] Add integration test for Cloud DLP URL: https://github.com/apache/beam/pull/10961#issuecomment-591830855 R: @aaltay cc: @kamilwu This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 394007) Time Spent: 3h 50m (was: 3h 40m) > [Python] PTransform that connects to Cloud DLP deidentification service > --- > > Key: BEAM-9258 > URL: https://issues.apache.org/jira/browse/BEAM-9258 > Project: Beam > Issue Type: Sub-task > Components: io-py-gcp >Reporter: Michał Walenia >Assignee: Michał Walenia >Priority: Major > Fix For: 2.20.0 > > Time Spent: 3h 50m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9258) [Python] PTransform that connects to Cloud DLP deidentification service
[ https://issues.apache.org/jira/browse/BEAM-9258?focusedWorklogId=393223&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-393223 ] ASF GitHub Bot logged work on BEAM-9258: Author: ASF GitHub Bot Created on: 26/Feb/20 07:57 Start Date: 26/Feb/20 07:57 Worklog Time Spent: 10m Work Description: mwalenia commented on issue #10961: [BEAM-9258] Add integration test for Cloud DLP URL: https://github.com/apache/beam/pull/10961#issuecomment-591289650 Run Python 2 PostCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 393223) Time Spent: 3h 40m (was: 3.5h) > [Python] PTransform that connects to Cloud DLP deidentification service > --- > > Key: BEAM-9258 > URL: https://issues.apache.org/jira/browse/BEAM-9258 > Project: Beam > Issue Type: Sub-task > Components: io-py-gcp >Reporter: Michał Walenia >Assignee: Michał Walenia >Priority: Major > Fix For: 2.20.0 > > Time Spent: 3h 40m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9258) [Python] PTransform that connects to Cloud DLP deidentification service
[ https://issues.apache.org/jira/browse/BEAM-9258?focusedWorklogId=392529&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-392529 ] ASF GitHub Bot logged work on BEAM-9258: Author: ASF GitHub Bot Created on: 25/Feb/20 12:46 Start Date: 25/Feb/20 12:46 Worklog Time Spent: 10m Work Description: mwalenia commented on issue #10961: [BEAM-9258] Add integration test for Cloud DLP URL: https://github.com/apache/beam/pull/10961#issuecomment-590849088 Run Python 2 PostCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 392529) Time Spent: 3.5h (was: 3h 20m) > [Python] PTransform that connects to Cloud DLP deidentification service > --- > > Key: BEAM-9258 > URL: https://issues.apache.org/jira/browse/BEAM-9258 > Project: Beam > Issue Type: Sub-task > Components: io-py-gcp >Reporter: Michał Walenia >Assignee: Michał Walenia >Priority: Major > Fix For: 2.20.0 > > Time Spent: 3.5h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9258) [Python] PTransform that connects to Cloud DLP deidentification service
[ https://issues.apache.org/jira/browse/BEAM-9258?focusedWorklogId=392528&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-392528 ] ASF GitHub Bot logged work on BEAM-9258: Author: ASF GitHub Bot Created on: 25/Feb/20 12:45 Start Date: 25/Feb/20 12:45 Worklog Time Spent: 10m Work Description: mwalenia commented on pull request #10961: [BEAM-9258] Add integration test for Cloud DLP URL: https://github.com/apache/beam/pull/10961 Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily: - [ ] [**Choose reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and mention them in a comment (`R: @username`). - [x] Format the pull request title like `[BEAM-XXX] Fixes bug in ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue. - [ ] Update `CHANGES.md` with noteworthy changes. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). See the [Contributor Guide](https://beam.apache.org/contribute) for more tips on [how to make review process smoother](https://beam.apache.org/contribute/#make-reviewers-job-easier). Post-Commit Tests Status (on master branch) Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark --- | --- | --- | --- | --- | --- | --- | --- Go | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/) Java | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_SparkStructuredStreaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_SparkStructuredStreaming/lastCompletedBuild/) Python | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit
[jira] [Work logged] (BEAM-9258) [Python] PTransform that connects to Cloud DLP deidentification service
[ https://issues.apache.org/jira/browse/BEAM-9258?focusedWorklogId=388891&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-388891 ] ASF GitHub Bot logged work on BEAM-9258: Author: ASF GitHub Bot Created on: 18/Feb/20 16:38 Start Date: 18/Feb/20 16:38 Worklog Time Spent: 10m Work Description: aaltay commented on pull request #10849: [BEAM-9258] Integrate Google Cloud Data loss prevention functionality for Python SDK URL: https://github.com/apache/beam/pull/10849 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 388891) Time Spent: 3h 10m (was: 3h) > [Python] PTransform that connects to Cloud DLP deidentification service > --- > > Key: BEAM-9258 > URL: https://issues.apache.org/jira/browse/BEAM-9258 > Project: Beam > Issue Type: Sub-task > Components: io-py-gcp >Reporter: Michał Walenia >Assignee: Michał Walenia >Priority: Major > Time Spent: 3h 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9258) [Python] PTransform that connects to Cloud DLP deidentification service
[ https://issues.apache.org/jira/browse/BEAM-9258?focusedWorklogId=388389&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-388389 ] ASF GitHub Bot logged work on BEAM-9258: Author: ASF GitHub Bot Created on: 17/Feb/20 12:57 Start Date: 17/Feb/20 12:57 Worklog Time Spent: 10m Work Description: mwalenia commented on issue #10849: [BEAM-9258] Integrate Google Cloud Data loss prevention functionality for Python SDK URL: https://github.com/apache/beam/pull/10849#issuecomment-586981182 @aaltay Thanks for your suggestions, I added the `project` param back to the classes. WDYT now? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 388389) Time Spent: 3h (was: 2h 50m) > [Python] PTransform that connects to Cloud DLP deidentification service > --- > > Key: BEAM-9258 > URL: https://issues.apache.org/jira/browse/BEAM-9258 > Project: Beam > Issue Type: Sub-task > Components: io-py-gcp >Reporter: Michał Walenia >Assignee: Michał Walenia >Priority: Major > Time Spent: 3h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9258) [Python] PTransform that connects to Cloud DLP deidentification service
[ https://issues.apache.org/jira/browse/BEAM-9258?focusedWorklogId=388246&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-388246 ] ASF GitHub Bot logged work on BEAM-9258: Author: ASF GitHub Bot Created on: 17/Feb/20 08:32 Start Date: 17/Feb/20 08:32 Worklog Time Spent: 10m Work Description: mwalenia commented on issue #10849: [BEAM-9258] Integrate Google Cloud Data loss prevention functionality for Python SDK URL: https://github.com/apache/beam/pull/10849#issuecomment-586874143 Run Python PreCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 388246) Time Spent: 2h 50m (was: 2h 40m) > [Python] PTransform that connects to Cloud DLP deidentification service > --- > > Key: BEAM-9258 > URL: https://issues.apache.org/jira/browse/BEAM-9258 > Project: Beam > Issue Type: Sub-task > Components: io-py-gcp >Reporter: Michał Walenia >Assignee: Michał Walenia >Priority: Major > Time Spent: 2h 50m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9258) [Python] PTransform that connects to Cloud DLP deidentification service
[ https://issues.apache.org/jira/browse/BEAM-9258?focusedWorklogId=387498&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-387498 ] ASF GitHub Bot logged work on BEAM-9258: Author: ASF GitHub Bot Created on: 14/Feb/20 17:31 Start Date: 14/Feb/20 17:31 Worklog Time Spent: 10m Work Description: aaltay commented on pull request #10849: [BEAM-9258] Integrate Google Cloud Data loss prevention functionality for Python SDK URL: https://github.com/apache/beam/pull/10849#discussion_r379554152 ## File path: sdks/python/apache_beam/ml/gcp/cloud_dlp.py ## @@ -0,0 +1,214 @@ +# /* +# * Licensed to the Apache Software Foundation (ASF) under one +# * or more contributor license agreements. See the NOTICE file +# * distributed with this work for additional information +# * regarding copyright ownership. The ASF licenses this file +# * to you under the Apache License, Version 2.0 (the +# * "License"); you may not use this file except in compliance +# * with the License. You may obtain a copy of the License at +# * +# * http://www.apache.org/licenses/LICENSE-2.0 +# * +# * Unless required by applicable law or agreed to in writing, software +# * distributed under the License is distributed on an "AS IS" BASIS, +# * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# * See the License for the specific language governing permissions and +# * limitations under the License. +# */ + +"""``PTransforms`` that implement Google Cloud Data Loss Prevention +functionality. +""" + +from __future__ import absolute_import + +import logging + +from google.cloud import dlp_v2 + +from apache_beam.options.pipeline_options import GoogleCloudOptions +from apache_beam.transforms import DoFn +from apache_beam.transforms import ParDo +from apache_beam.transforms import PTransform +from apache_beam.utils.annotations import experimental + +__all__ = ['MaskDetectedDetails', 'InspectForDetails'] + +_LOGGER = logging.getLogger(__name__) + + +@experimental() +class MaskDetectedDetails(PTransform): + """Scrubs sensitive information detected in text. + The ``PTransform`` returns a ``PCollection`` of ``str`` + Example usage:: +pipeline | MaskDetectedDetails(project='example-gcp-project', Review comment: You may need to update this example for the project argument. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 387498) Time Spent: 2h 40m (was: 2.5h) > [Python] PTransform that connects to Cloud DLP deidentification service > --- > > Key: BEAM-9258 > URL: https://issues.apache.org/jira/browse/BEAM-9258 > Project: Beam > Issue Type: Sub-task > Components: io-py-gcp >Reporter: Michał Walenia >Assignee: Michał Walenia >Priority: Major > Time Spent: 2h 40m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9258) [Python] PTransform that connects to Cloud DLP deidentification service
[ https://issues.apache.org/jira/browse/BEAM-9258?focusedWorklogId=387496&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-387496 ] ASF GitHub Bot logged work on BEAM-9258: Author: ASF GitHub Bot Created on: 14/Feb/20 17:31 Start Date: 14/Feb/20 17:31 Worklog Time Spent: 10m Work Description: aaltay commented on issue #10849: [BEAM-9258] Integrate Google Cloud Data loss prevention functionality for Python SDK URL: https://github.com/apache/beam/pull/10849#issuecomment-586388747 LGTM. Related to `project`. Do you think we should leave it and default it to the project from the flags. The reason is, people sometimes want to use different projects for other resources. (e.g. run dataflow job on project X, make api calls to project Y.) Making it have a default but configurable would help users. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 387496) Time Spent: 2.5h (was: 2h 20m) > [Python] PTransform that connects to Cloud DLP deidentification service > --- > > Key: BEAM-9258 > URL: https://issues.apache.org/jira/browse/BEAM-9258 > Project: Beam > Issue Type: Sub-task > Components: io-py-gcp >Reporter: Michał Walenia >Assignee: Michał Walenia >Priority: Major > Time Spent: 2.5h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9258) [Python] PTransform that connects to Cloud DLP deidentification service
[ https://issues.apache.org/jira/browse/BEAM-9258?focusedWorklogId=387313&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-387313 ] ASF GitHub Bot logged work on BEAM-9258: Author: ASF GitHub Bot Created on: 14/Feb/20 12:16 Start Date: 14/Feb/20 12:16 Worklog Time Spent: 10m Work Description: mwalenia commented on issue #10849: [BEAM-9258] Integrate Google Cloud Data loss prevention functionality for Python SDK URL: https://github.com/apache/beam/pull/10849#issuecomment-586264251 Run Python PreCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 387313) Time Spent: 2h 20m (was: 2h 10m) > [Python] PTransform that connects to Cloud DLP deidentification service > --- > > Key: BEAM-9258 > URL: https://issues.apache.org/jira/browse/BEAM-9258 > Project: Beam > Issue Type: Sub-task > Components: io-py-gcp >Reporter: Michał Walenia >Assignee: Michał Walenia >Priority: Major > Time Spent: 2h 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9258) [Python] PTransform that connects to Cloud DLP deidentification service
[ https://issues.apache.org/jira/browse/BEAM-9258?focusedWorklogId=387258&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-387258 ] ASF GitHub Bot logged work on BEAM-9258: Author: ASF GitHub Bot Created on: 14/Feb/20 10:07 Start Date: 14/Feb/20 10:07 Worklog Time Spent: 10m Work Description: mwalenia commented on issue #10849: [BEAM-9258] Integrate Google Cloud Data loss prevention functionality for Python SDK URL: https://github.com/apache/beam/pull/10849#issuecomment-586190524 Run Python2_PVR_Flink PreCommit This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 387258) Time Spent: 2h 10m (was: 2h) > [Python] PTransform that connects to Cloud DLP deidentification service > --- > > Key: BEAM-9258 > URL: https://issues.apache.org/jira/browse/BEAM-9258 > Project: Beam > Issue Type: Sub-task > Components: io-py-gcp >Reporter: Michał Walenia >Assignee: Michał Walenia >Priority: Major > Time Spent: 2h 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9258) [Python] PTransform that connects to Cloud DLP deidentification service
[ https://issues.apache.org/jira/browse/BEAM-9258?focusedWorklogId=387254&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-387254 ] ASF GitHub Bot logged work on BEAM-9258: Author: ASF GitHub Bot Created on: 14/Feb/20 09:47 Start Date: 14/Feb/20 09:47 Worklog Time Spent: 10m Work Description: mwalenia commented on issue #10849: [BEAM-9258] Integrate Google Cloud Data loss prevention functionality for Python SDK URL: https://github.com/apache/beam/pull/10849#issuecomment-586182523 @aaltay Thanks for the review, I fixed the code so that `--project` pipeline option is used for project name. I also changed the way arguments were handled to reflect their meaning in underlying DLP service - template name and config object are allowed simultaneously, but config overrides the template. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 387254) Time Spent: 2h (was: 1h 50m) > [Python] PTransform that connects to Cloud DLP deidentification service > --- > > Key: BEAM-9258 > URL: https://issues.apache.org/jira/browse/BEAM-9258 > Project: Beam > Issue Type: Sub-task > Components: io-py-gcp >Reporter: Michał Walenia >Assignee: Michał Walenia >Priority: Major > Time Spent: 2h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9258) [Python] PTransform that connects to Cloud DLP deidentification service
[ https://issues.apache.org/jira/browse/BEAM-9258?focusedWorklogId=387233&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-387233 ] ASF GitHub Bot logged work on BEAM-9258: Author: ASF GitHub Bot Created on: 14/Feb/20 09:18 Start Date: 14/Feb/20 09:18 Worklog Time Spent: 10m Work Description: mwalenia commented on pull request #10849: [BEAM-9258] Integrate Google Cloud Data loss prevention functionality for Python SDK URL: https://github.com/apache/beam/pull/10849#discussion_r379323072 ## File path: sdks/python/apache_beam/ml/gcp/cloud_dlp.py ## @@ -0,0 +1,224 @@ +# /* +# * Licensed to the Apache Software Foundation (ASF) under one +# * or more contributor license agreements. See the NOTICE file +# * distributed with this work for additional information +# * regarding copyright ownership. The ASF licenses this file +# * to you under the Apache License, Version 2.0 (the +# * "License"); you may not use this file except in compliance +# * with the License. You may obtain a copy of the License at +# * +# * http://www.apache.org/licenses/LICENSE-2.0 +# * +# * Unless required by applicable law or agreed to in writing, software +# * distributed under the License is distributed on an "AS IS" BASIS, +# * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# * See the License for the specific language governing permissions and +# * limitations under the License. +# */ + +"""``PTransforms`` that implement Google Cloud Data Loss Prevention +functionality. +""" + +from __future__ import absolute_import + +import logging + +from google.cloud import dlp_v2 + +import apache_beam as beam +from apache_beam.utils import retry +from apache_beam.utils.annotations import experimental + +__all__ = ['MaskDetectedDetails', 'InspectForDetails'] + +_LOGGER = logging.getLogger(__name__) + + +@experimental() +class MaskDetectedDetails(beam.PTransform): + """Scrubs sensitive information detected in text. + The ``PTransform`` returns a ``PCollection`` of ``str`` + Example usage:: +pipeline | MaskDetectedDetails(project='example-gcp-project', + deidentification_config={ + 'info_type_transformations: { + 'transformations': [{ + 'primitive_transformation': { + 'character_mask_config': { + 'masking_character': '#' + } + } + }] + } + }, inspection_config={'info_types': [{'name': 'EMAIL_ADDRESS'}]}) + """ + def __init__( + self, + project=None, + deidentification_template_name=None, + deidentification_config=None, + inspection_template_name=None, + inspection_config=None, + timeout=None): +"""Initializes a :class:`MaskDetectedDetails` transform. +Args: + project (str): Required. GCP project in which the data processing is +to be done + deidentification_template_name (str): Either this or +`deidentification_config` required. Name of +deidentification template to be used on detected sensitive information +instances in text. + deidentification_config +(``Union[dict, google.cloud.dlp_v2.types.DeidentifyConfig]``): +Configuration for the de-identification of the content item. + inspection_template_name (str): This or `inspection_config` required. +Name of inspection template to be used +to detect sensitive data in text. + inspection_config +(``Union[dict, google.cloud.dlp_v2.types.InspectConfig]``): +Configuration for the inspector used to detect sensitive data in text. + timeout (float): Optional. The amount of time, in seconds, to wait for +the request to complete. +""" +self.config = {} +self.project = project +self.timeout = timeout +if project is None: + raise ValueError( + 'GCP project name needs to be specified in "project" property') +if deidentification_template_name is not None \ +and deidentification_config is not None: + raise ValueError( + 'Both deidentification_template_name and ' + 'deidentification_config were specified.' + ' Please specify only one of these.') +elif deidentification_template_name is None \ +and deidentification_config is None: + raise ValueError( + 'deidentification_template_name or ' + 'deidentification_config must be specified.') +elif deidentification_template_name is not None: + self.config['deidentify_template_name'] = deidentification_template_name +else: + self.config['deidentify_config'] = deidentification_config + +if inspection_template_name is not None and inspection_config is not None: + raise ValueError( + 'Both ins
[jira] [Work logged] (BEAM-9258) [Python] PTransform that connects to Cloud DLP deidentification service
[ https://issues.apache.org/jira/browse/BEAM-9258?focusedWorklogId=387225&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-387225 ] ASF GitHub Bot logged work on BEAM-9258: Author: ASF GitHub Bot Created on: 14/Feb/20 09:09 Start Date: 14/Feb/20 09:09 Worklog Time Spent: 10m Work Description: mwalenia commented on pull request #10849: [BEAM-9258] Integrate Google Cloud Data loss prevention functionality for Python SDK URL: https://github.com/apache/beam/pull/10849#discussion_r379318925 ## File path: sdks/python/apache_beam/ml/gcp/cloud_dlp.py ## @@ -0,0 +1,224 @@ +# /* +# * Licensed to the Apache Software Foundation (ASF) under one +# * or more contributor license agreements. See the NOTICE file +# * distributed with this work for additional information +# * regarding copyright ownership. The ASF licenses this file +# * to you under the Apache License, Version 2.0 (the +# * "License"); you may not use this file except in compliance +# * with the License. You may obtain a copy of the License at +# * +# * http://www.apache.org/licenses/LICENSE-2.0 +# * +# * Unless required by applicable law or agreed to in writing, software +# * distributed under the License is distributed on an "AS IS" BASIS, +# * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# * See the License for the specific language governing permissions and +# * limitations under the License. +# */ + +"""``PTransforms`` that implement Google Cloud Data Loss Prevention +functionality. +""" + +from __future__ import absolute_import + +import logging + +from google.cloud import dlp_v2 + +import apache_beam as beam Review comment: Ok! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 387225) Time Spent: 1h 40m (was: 1.5h) > [Python] PTransform that connects to Cloud DLP deidentification service > --- > > Key: BEAM-9258 > URL: https://issues.apache.org/jira/browse/BEAM-9258 > Project: Beam > Issue Type: Sub-task > Components: io-py-gcp >Reporter: Michał Walenia >Assignee: Michał Walenia >Priority: Major > Time Spent: 1h 40m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9258) [Python] PTransform that connects to Cloud DLP deidentification service
[ https://issues.apache.org/jira/browse/BEAM-9258?focusedWorklogId=387223&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-387223 ] ASF GitHub Bot logged work on BEAM-9258: Author: ASF GitHub Bot Created on: 14/Feb/20 09:08 Start Date: 14/Feb/20 09:08 Worklog Time Spent: 10m Work Description: mwalenia commented on pull request #10849: [BEAM-9258] Integrate Google Cloud Data loss prevention functionality for Python SDK URL: https://github.com/apache/beam/pull/10849#discussion_r379318446 ## File path: sdks/python/apache_beam/ml/gcp/cloud_dlp.py ## @@ -0,0 +1,224 @@ +# /* +# * Licensed to the Apache Software Foundation (ASF) under one +# * or more contributor license agreements. See the NOTICE file +# * distributed with this work for additional information +# * regarding copyright ownership. The ASF licenses this file +# * to you under the Apache License, Version 2.0 (the +# * "License"); you may not use this file except in compliance +# * with the License. You may obtain a copy of the License at +# * +# * http://www.apache.org/licenses/LICENSE-2.0 +# * +# * Unless required by applicable law or agreed to in writing, software +# * distributed under the License is distributed on an "AS IS" BASIS, +# * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# * See the License for the specific language governing permissions and +# * limitations under the License. +# */ + +"""``PTransforms`` that implement Google Cloud Data Loss Prevention +functionality. +""" + +from __future__ import absolute_import + +import logging + +from google.cloud import dlp_v2 + +import apache_beam as beam +from apache_beam.utils import retry +from apache_beam.utils.annotations import experimental + +__all__ = ['MaskDetectedDetails', 'InspectForDetails'] + +_LOGGER = logging.getLogger(__name__) + + +@experimental() +class MaskDetectedDetails(beam.PTransform): + """Scrubs sensitive information detected in text. + The ``PTransform`` returns a ``PCollection`` of ``str`` + Example usage:: +pipeline | MaskDetectedDetails(project='example-gcp-project', + deidentification_config={ + 'info_type_transformations: { + 'transformations': [{ + 'primitive_transformation': { + 'character_mask_config': { + 'masking_character': '#' + } + } + }] + } + }, inspection_config={'info_types': [{'name': 'EMAIL_ADDRESS'}]}) + """ + def __init__( + self, + project=None, + deidentification_template_name=None, + deidentification_config=None, + inspection_template_name=None, + inspection_config=None, + timeout=None): +"""Initializes a :class:`MaskDetectedDetails` transform. +Args: + project (str): Required. GCP project in which the data processing is +to be done + deidentification_template_name (str): Either this or +`deidentification_config` required. Name of +deidentification template to be used on detected sensitive information +instances in text. + deidentification_config +(``Union[dict, google.cloud.dlp_v2.types.DeidentifyConfig]``): +Configuration for the de-identification of the content item. + inspection_template_name (str): This or `inspection_config` required. +Name of inspection template to be used +to detect sensitive data in text. + inspection_config +(``Union[dict, google.cloud.dlp_v2.types.InspectConfig]``): +Configuration for the inspector used to detect sensitive data in text. + timeout (float): Optional. The amount of time, in seconds, to wait for +the request to complete. +""" +self.config = {} +self.project = project +self.timeout = timeout +if project is None: + raise ValueError( + 'GCP project name needs to be specified in "project" property') +if deidentification_template_name is not None \ +and deidentification_config is not None: + raise ValueError( + 'Both deidentification_template_name and ' + 'deidentification_config were specified.' + ' Please specify only one of these.') +elif deidentification_template_name is None \ +and deidentification_config is None: + raise ValueError( + 'deidentification_template_name or ' + 'deidentification_config must be specified.') +elif deidentification_template_name is not None: + self.config['deidentify_template_name'] = deidentification_template_name +else: + self.config['deidentify_config'] = deidentification_config + +if inspection_template_name is not None and inspection_config is not None: + raise ValueError( + 'Both ins
[jira] [Work logged] (BEAM-9258) [Python] PTransform that connects to Cloud DLP deidentification service
[ https://issues.apache.org/jira/browse/BEAM-9258?focusedWorklogId=387222&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-387222 ] ASF GitHub Bot logged work on BEAM-9258: Author: ASF GitHub Bot Created on: 14/Feb/20 09:07 Start Date: 14/Feb/20 09:07 Worklog Time Spent: 10m Work Description: mwalenia commented on pull request #10849: [BEAM-9258] Integrate Google Cloud Data loss prevention functionality for Python SDK URL: https://github.com/apache/beam/pull/10849#discussion_r379318361 ## File path: sdks/python/apache_beam/ml/gcp/cloud_dlp.py ## @@ -0,0 +1,224 @@ +# /* +# * Licensed to the Apache Software Foundation (ASF) under one +# * or more contributor license agreements. See the NOTICE file +# * distributed with this work for additional information +# * regarding copyright ownership. The ASF licenses this file +# * to you under the Apache License, Version 2.0 (the +# * "License"); you may not use this file except in compliance +# * with the License. You may obtain a copy of the License at +# * +# * http://www.apache.org/licenses/LICENSE-2.0 +# * +# * Unless required by applicable law or agreed to in writing, software +# * distributed under the License is distributed on an "AS IS" BASIS, +# * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# * See the License for the specific language governing permissions and +# * limitations under the License. +# */ + +"""``PTransforms`` that implement Google Cloud Data Loss Prevention +functionality. +""" + +from __future__ import absolute_import + +import logging + +from google.cloud import dlp_v2 + +import apache_beam as beam +from apache_beam.utils import retry +from apache_beam.utils.annotations import experimental + +__all__ = ['MaskDetectedDetails', 'InspectForDetails'] + +_LOGGER = logging.getLogger(__name__) + + +@experimental() +class MaskDetectedDetails(beam.PTransform): + """Scrubs sensitive information detected in text. + The ``PTransform`` returns a ``PCollection`` of ``str`` + Example usage:: +pipeline | MaskDetectedDetails(project='example-gcp-project', + deidentification_config={ + 'info_type_transformations: { + 'transformations': [{ + 'primitive_transformation': { + 'character_mask_config': { + 'masking_character': '#' + } + } + }] + } + }, inspection_config={'info_types': [{'name': 'EMAIL_ADDRESS'}]}) + """ + def __init__( + self, + project=None, + deidentification_template_name=None, + deidentification_config=None, + inspection_template_name=None, + inspection_config=None, + timeout=None): +"""Initializes a :class:`MaskDetectedDetails` transform. +Args: + project (str): Required. GCP project in which the data processing is +to be done + deidentification_template_name (str): Either this or +`deidentification_config` required. Name of +deidentification template to be used on detected sensitive information +instances in text. + deidentification_config +(``Union[dict, google.cloud.dlp_v2.types.DeidentifyConfig]``): +Configuration for the de-identification of the content item. + inspection_template_name (str): This or `inspection_config` required. +Name of inspection template to be used +to detect sensitive data in text. + inspection_config +(``Union[dict, google.cloud.dlp_v2.types.InspectConfig]``): +Configuration for the inspector used to detect sensitive data in text. + timeout (float): Optional. The amount of time, in seconds, to wait for +the request to complete. +""" +self.config = {} +self.project = project +self.timeout = timeout +if project is None: + raise ValueError( + 'GCP project name needs to be specified in "project" property') +if deidentification_template_name is not None \ +and deidentification_config is not None: + raise ValueError( + 'Both deidentification_template_name and ' + 'deidentification_config were specified.' + ' Please specify only one of these.') +elif deidentification_template_name is None \ +and deidentification_config is None: + raise ValueError( + 'deidentification_template_name or ' + 'deidentification_config must be specified.') +elif deidentification_template_name is not None: + self.config['deidentify_template_name'] = deidentification_template_name +else: + self.config['deidentify_config'] = deidentification_config + +if inspection_template_name is not None and inspection_config is not None: + raise ValueError( + 'Both ins
[jira] [Work logged] (BEAM-9258) [Python] PTransform that connects to Cloud DLP deidentification service
[ https://issues.apache.org/jira/browse/BEAM-9258?focusedWorklogId=386755&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-386755 ] ASF GitHub Bot logged work on BEAM-9258: Author: ASF GitHub Bot Created on: 13/Feb/20 17:10 Start Date: 13/Feb/20 17:10 Worklog Time Spent: 10m Work Description: aaltay commented on pull request #10849: [BEAM-9258] Integrate Google Cloud Data loss prevention functionality for Python SDK URL: https://github.com/apache/beam/pull/10849#discussion_r378987762 ## File path: sdks/python/apache_beam/ml/gcp/cloud_dlp.py ## @@ -0,0 +1,224 @@ +# /* +# * Licensed to the Apache Software Foundation (ASF) under one +# * or more contributor license agreements. See the NOTICE file +# * distributed with this work for additional information +# * regarding copyright ownership. The ASF licenses this file +# * to you under the Apache License, Version 2.0 (the +# * "License"); you may not use this file except in compliance +# * with the License. You may obtain a copy of the License at +# * +# * http://www.apache.org/licenses/LICENSE-2.0 +# * +# * Unless required by applicable law or agreed to in writing, software +# * distributed under the License is distributed on an "AS IS" BASIS, +# * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# * See the License for the specific language governing permissions and +# * limitations under the License. +# */ + +"""``PTransforms`` that implement Google Cloud Data Loss Prevention +functionality. +""" + +from __future__ import absolute_import + +import logging + +from google.cloud import dlp_v2 + +import apache_beam as beam +from apache_beam.utils import retry +from apache_beam.utils.annotations import experimental + +__all__ = ['MaskDetectedDetails', 'InspectForDetails'] + +_LOGGER = logging.getLogger(__name__) + + +@experimental() +class MaskDetectedDetails(beam.PTransform): + """Scrubs sensitive information detected in text. + The ``PTransform`` returns a ``PCollection`` of ``str`` + Example usage:: +pipeline | MaskDetectedDetails(project='example-gcp-project', + deidentification_config={ + 'info_type_transformations: { + 'transformations': [{ + 'primitive_transformation': { + 'character_mask_config': { + 'masking_character': '#' + } + } + }] + } + }, inspection_config={'info_types': [{'name': 'EMAIL_ADDRESS'}]}) + """ + def __init__( + self, + project=None, + deidentification_template_name=None, + deidentification_config=None, + inspection_template_name=None, + inspection_config=None, + timeout=None): +"""Initializes a :class:`MaskDetectedDetails` transform. +Args: + project (str): Required. GCP project in which the data processing is Review comment: Would it make sense to default to the project from gcp options? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 386755) Time Spent: 0.5h (was: 20m) > [Python] PTransform that connects to Cloud DLP deidentification service > --- > > Key: BEAM-9258 > URL: https://issues.apache.org/jira/browse/BEAM-9258 > Project: Beam > Issue Type: Sub-task > Components: io-py-gcp >Reporter: Michał Walenia >Assignee: Michał Walenia >Priority: Major > Time Spent: 0.5h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9258) [Python] PTransform that connects to Cloud DLP deidentification service
[ https://issues.apache.org/jira/browse/BEAM-9258?focusedWorklogId=386756&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-386756 ] ASF GitHub Bot logged work on BEAM-9258: Author: ASF GitHub Bot Created on: 13/Feb/20 17:10 Start Date: 13/Feb/20 17:10 Worklog Time Spent: 10m Work Description: aaltay commented on pull request #10849: [BEAM-9258] Integrate Google Cloud Data loss prevention functionality for Python SDK URL: https://github.com/apache/beam/pull/10849#discussion_r378989261 ## File path: sdks/python/apache_beam/ml/gcp/cloud_dlp.py ## @@ -0,0 +1,224 @@ +# /* +# * Licensed to the Apache Software Foundation (ASF) under one +# * or more contributor license agreements. See the NOTICE file +# * distributed with this work for additional information +# * regarding copyright ownership. The ASF licenses this file +# * to you under the Apache License, Version 2.0 (the +# * "License"); you may not use this file except in compliance +# * with the License. You may obtain a copy of the License at +# * +# * http://www.apache.org/licenses/LICENSE-2.0 +# * +# * Unless required by applicable law or agreed to in writing, software +# * distributed under the License is distributed on an "AS IS" BASIS, +# * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# * See the License for the specific language governing permissions and +# * limitations under the License. +# */ + +"""``PTransforms`` that implement Google Cloud Data Loss Prevention +functionality. +""" + +from __future__ import absolute_import + +import logging + +from google.cloud import dlp_v2 + +import apache_beam as beam Review comment: Let's do more explicit imports here e.g. from apache_beam.transforms import ParDo from apache_beam.transforms import PTransform then use them directly like `PTransform` instead of the `beam.PTransform` style. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 386756) Time Spent: 40m (was: 0.5h) > [Python] PTransform that connects to Cloud DLP deidentification service > --- > > Key: BEAM-9258 > URL: https://issues.apache.org/jira/browse/BEAM-9258 > Project: Beam > Issue Type: Sub-task > Components: io-py-gcp >Reporter: Michał Walenia >Assignee: Michał Walenia >Priority: Major > Time Spent: 40m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9258) [Python] PTransform that connects to Cloud DLP deidentification service
[ https://issues.apache.org/jira/browse/BEAM-9258?focusedWorklogId=386758&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-386758 ] ASF GitHub Bot logged work on BEAM-9258: Author: ASF GitHub Bot Created on: 13/Feb/20 17:10 Start Date: 13/Feb/20 17:10 Worklog Time Spent: 10m Work Description: aaltay commented on pull request #10849: [BEAM-9258] Integrate Google Cloud Data loss prevention functionality for Python SDK URL: https://github.com/apache/beam/pull/10849#discussion_r378998869 ## File path: sdks/python/setup.py ## @@ -203,6 +203,7 @@ def get_version(): 'google-cloud-bigquery>=1.6.0,<1.18.0', 'google-cloud-core>=0.28.1,<2', 'google-cloud-bigtable>=0.31.1,<1.1.0', +'google-cloud-dlp >=0.12.0,<=0.13.0', Review comment: Version after 0.13 will not support py2. (notice at the top: https://googleapis.dev/python/dlp/latest/gapic/v2/api.html) I wonder if we need to add a comment note here for the person that will upgrade version ranges next? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 386758) Time Spent: 50m (was: 40m) > [Python] PTransform that connects to Cloud DLP deidentification service > --- > > Key: BEAM-9258 > URL: https://issues.apache.org/jira/browse/BEAM-9258 > Project: Beam > Issue Type: Sub-task > Components: io-py-gcp >Reporter: Michał Walenia >Assignee: Michał Walenia >Priority: Major > Time Spent: 50m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9258) [Python] PTransform that connects to Cloud DLP deidentification service
[ https://issues.apache.org/jira/browse/BEAM-9258?focusedWorklogId=386757&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-386757 ] ASF GitHub Bot logged work on BEAM-9258: Author: ASF GitHub Bot Created on: 13/Feb/20 17:10 Start Date: 13/Feb/20 17:10 Worklog Time Spent: 10m Work Description: aaltay commented on pull request #10849: [BEAM-9258] Integrate Google Cloud Data loss prevention functionality for Python SDK URL: https://github.com/apache/beam/pull/10849#discussion_r378990795 ## File path: sdks/python/apache_beam/ml/gcp/cloud_dlp.py ## @@ -0,0 +1,224 @@ +# /* +# * Licensed to the Apache Software Foundation (ASF) under one +# * or more contributor license agreements. See the NOTICE file +# * distributed with this work for additional information +# * regarding copyright ownership. The ASF licenses this file +# * to you under the Apache License, Version 2.0 (the +# * "License"); you may not use this file except in compliance +# * with the License. You may obtain a copy of the License at +# * +# * http://www.apache.org/licenses/LICENSE-2.0 +# * +# * Unless required by applicable law or agreed to in writing, software +# * distributed under the License is distributed on an "AS IS" BASIS, +# * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# * See the License for the specific language governing permissions and +# * limitations under the License. +# */ + +"""``PTransforms`` that implement Google Cloud Data Loss Prevention +functionality. +""" + +from __future__ import absolute_import + +import logging + +from google.cloud import dlp_v2 + +import apache_beam as beam +from apache_beam.utils import retry +from apache_beam.utils.annotations import experimental + +__all__ = ['MaskDetectedDetails', 'InspectForDetails'] + +_LOGGER = logging.getLogger(__name__) + + +@experimental() +class MaskDetectedDetails(beam.PTransform): + """Scrubs sensitive information detected in text. + The ``PTransform`` returns a ``PCollection`` of ``str`` + Example usage:: +pipeline | MaskDetectedDetails(project='example-gcp-project', + deidentification_config={ + 'info_type_transformations: { + 'transformations': [{ + 'primitive_transformation': { + 'character_mask_config': { + 'masking_character': '#' + } + } + }] + } + }, inspection_config={'info_types': [{'name': 'EMAIL_ADDRESS'}]}) + """ + def __init__( + self, + project=None, + deidentification_template_name=None, + deidentification_config=None, + inspection_template_name=None, + inspection_config=None, + timeout=None): +"""Initializes a :class:`MaskDetectedDetails` transform. +Args: + project (str): Required. GCP project in which the data processing is +to be done + deidentification_template_name (str): Either this or +`deidentification_config` required. Name of +deidentification template to be used on detected sensitive information +instances in text. + deidentification_config +(``Union[dict, google.cloud.dlp_v2.types.DeidentifyConfig]``): +Configuration for the de-identification of the content item. + inspection_template_name (str): This or `inspection_config` required. +Name of inspection template to be used +to detect sensitive data in text. + inspection_config +(``Union[dict, google.cloud.dlp_v2.types.InspectConfig]``): +Configuration for the inspector used to detect sensitive data in text. + timeout (float): Optional. The amount of time, in seconds, to wait for +the request to complete. +""" +self.config = {} +self.project = project +self.timeout = timeout +if project is None: + raise ValueError( + 'GCP project name needs to be specified in "project" property') +if deidentification_template_name is not None \ +and deidentification_config is not None: + raise ValueError( + 'Both deidentification_template_name and ' + 'deidentification_config were specified.' + ' Please specify only one of these.') +elif deidentification_template_name is None \ +and deidentification_config is None: + raise ValueError( + 'deidentification_template_name or ' + 'deidentification_config must be specified.') +elif deidentification_template_name is not None: + self.config['deidentify_template_name'] = deidentification_template_name +else: + self.config['deidentify_config'] = deidentification_config + +if inspection_template_name is not None and inspection_config is not None: + raise ValueError( + 'Both inspe
[jira] [Work logged] (BEAM-9258) [Python] PTransform that connects to Cloud DLP deidentification service
[ https://issues.apache.org/jira/browse/BEAM-9258?focusedWorklogId=386759&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-386759 ] ASF GitHub Bot logged work on BEAM-9258: Author: ASF GitHub Bot Created on: 13/Feb/20 17:10 Start Date: 13/Feb/20 17:10 Worklog Time Spent: 10m Work Description: aaltay commented on pull request #10849: [BEAM-9258] Integrate Google Cloud Data loss prevention functionality for Python SDK URL: https://github.com/apache/beam/pull/10849#discussion_r378995313 ## File path: sdks/python/apache_beam/ml/gcp/cloud_dlp.py ## @@ -0,0 +1,224 @@ +# /* +# * Licensed to the Apache Software Foundation (ASF) under one +# * or more contributor license agreements. See the NOTICE file +# * distributed with this work for additional information +# * regarding copyright ownership. The ASF licenses this file +# * to you under the Apache License, Version 2.0 (the +# * "License"); you may not use this file except in compliance +# * with the License. You may obtain a copy of the License at +# * +# * http://www.apache.org/licenses/LICENSE-2.0 +# * +# * Unless required by applicable law or agreed to in writing, software +# * distributed under the License is distributed on an "AS IS" BASIS, +# * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# * See the License for the specific language governing permissions and +# * limitations under the License. +# */ + +"""``PTransforms`` that implement Google Cloud Data Loss Prevention +functionality. +""" + +from __future__ import absolute_import + +import logging + +from google.cloud import dlp_v2 + +import apache_beam as beam +from apache_beam.utils import retry +from apache_beam.utils.annotations import experimental + +__all__ = ['MaskDetectedDetails', 'InspectForDetails'] + +_LOGGER = logging.getLogger(__name__) + + +@experimental() +class MaskDetectedDetails(beam.PTransform): + """Scrubs sensitive information detected in text. + The ``PTransform`` returns a ``PCollection`` of ``str`` + Example usage:: +pipeline | MaskDetectedDetails(project='example-gcp-project', + deidentification_config={ + 'info_type_transformations: { + 'transformations': [{ + 'primitive_transformation': { + 'character_mask_config': { + 'masking_character': '#' + } + } + }] + } + }, inspection_config={'info_types': [{'name': 'EMAIL_ADDRESS'}]}) + """ + def __init__( + self, + project=None, + deidentification_template_name=None, + deidentification_config=None, + inspection_template_name=None, + inspection_config=None, + timeout=None): +"""Initializes a :class:`MaskDetectedDetails` transform. +Args: + project (str): Required. GCP project in which the data processing is +to be done + deidentification_template_name (str): Either this or +`deidentification_config` required. Name of +deidentification template to be used on detected sensitive information +instances in text. + deidentification_config +(``Union[dict, google.cloud.dlp_v2.types.DeidentifyConfig]``): +Configuration for the de-identification of the content item. + inspection_template_name (str): This or `inspection_config` required. +Name of inspection template to be used +to detect sensitive data in text. + inspection_config +(``Union[dict, google.cloud.dlp_v2.types.InspectConfig]``): +Configuration for the inspector used to detect sensitive data in text. + timeout (float): Optional. The amount of time, in seconds, to wait for +the request to complete. +""" +self.config = {} +self.project = project +self.timeout = timeout +if project is None: + raise ValueError( + 'GCP project name needs to be specified in "project" property') +if deidentification_template_name is not None \ +and deidentification_config is not None: + raise ValueError( + 'Both deidentification_template_name and ' + 'deidentification_config were specified.' + ' Please specify only one of these.') +elif deidentification_template_name is None \ +and deidentification_config is None: + raise ValueError( + 'deidentification_template_name or ' + 'deidentification_config must be specified.') +elif deidentification_template_name is not None: + self.config['deidentify_template_name'] = deidentification_template_name +else: + self.config['deidentify_config'] = deidentification_config + +if inspection_template_name is not None and inspection_config is not None: + raise ValueError( + 'Both inspe
[jira] [Work logged] (BEAM-9258) [Python] PTransform that connects to Cloud DLP deidentification service
[ https://issues.apache.org/jira/browse/BEAM-9258?focusedWorklogId=386760&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-386760 ] ASF GitHub Bot logged work on BEAM-9258: Author: ASF GitHub Bot Created on: 13/Feb/20 17:10 Start Date: 13/Feb/20 17:10 Worklog Time Spent: 10m Work Description: aaltay commented on pull request #10849: [BEAM-9258] Integrate Google Cloud Data loss prevention functionality for Python SDK URL: https://github.com/apache/beam/pull/10849#discussion_r378994491 ## File path: sdks/python/apache_beam/ml/gcp/cloud_dlp.py ## @@ -0,0 +1,224 @@ +# /* +# * Licensed to the Apache Software Foundation (ASF) under one +# * or more contributor license agreements. See the NOTICE file +# * distributed with this work for additional information +# * regarding copyright ownership. The ASF licenses this file +# * to you under the Apache License, Version 2.0 (the +# * "License"); you may not use this file except in compliance +# * with the License. You may obtain a copy of the License at +# * +# * http://www.apache.org/licenses/LICENSE-2.0 +# * +# * Unless required by applicable law or agreed to in writing, software +# * distributed under the License is distributed on an "AS IS" BASIS, +# * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# * See the License for the specific language governing permissions and +# * limitations under the License. +# */ + +"""``PTransforms`` that implement Google Cloud Data Loss Prevention +functionality. +""" + +from __future__ import absolute_import + +import logging + +from google.cloud import dlp_v2 + +import apache_beam as beam +from apache_beam.utils import retry +from apache_beam.utils.annotations import experimental + +__all__ = ['MaskDetectedDetails', 'InspectForDetails'] + +_LOGGER = logging.getLogger(__name__) + + +@experimental() +class MaskDetectedDetails(beam.PTransform): + """Scrubs sensitive information detected in text. + The ``PTransform`` returns a ``PCollection`` of ``str`` + Example usage:: +pipeline | MaskDetectedDetails(project='example-gcp-project', + deidentification_config={ + 'info_type_transformations: { + 'transformations': [{ + 'primitive_transformation': { + 'character_mask_config': { + 'masking_character': '#' + } + } + }] + } + }, inspection_config={'info_types': [{'name': 'EMAIL_ADDRESS'}]}) + """ + def __init__( + self, + project=None, + deidentification_template_name=None, + deidentification_config=None, + inspection_template_name=None, + inspection_config=None, + timeout=None): +"""Initializes a :class:`MaskDetectedDetails` transform. +Args: + project (str): Required. GCP project in which the data processing is +to be done + deidentification_template_name (str): Either this or +`deidentification_config` required. Name of +deidentification template to be used on detected sensitive information +instances in text. + deidentification_config +(``Union[dict, google.cloud.dlp_v2.types.DeidentifyConfig]``): +Configuration for the de-identification of the content item. + inspection_template_name (str): This or `inspection_config` required. +Name of inspection template to be used +to detect sensitive data in text. + inspection_config +(``Union[dict, google.cloud.dlp_v2.types.InspectConfig]``): +Configuration for the inspector used to detect sensitive data in text. + timeout (float): Optional. The amount of time, in seconds, to wait for +the request to complete. +""" +self.config = {} +self.project = project +self.timeout = timeout +if project is None: + raise ValueError( + 'GCP project name needs to be specified in "project" property') +if deidentification_template_name is not None \ +and deidentification_config is not None: + raise ValueError( + 'Both deidentification_template_name and ' + 'deidentification_config were specified.' + ' Please specify only one of these.') +elif deidentification_template_name is None \ +and deidentification_config is None: + raise ValueError( + 'deidentification_template_name or ' + 'deidentification_config must be specified.') +elif deidentification_template_name is not None: + self.config['deidentify_template_name'] = deidentification_template_name +else: + self.config['deidentify_config'] = deidentification_config + +if inspection_template_name is not None and inspection_config is not None: + raise ValueError( + 'Both inspe
[jira] [Work logged] (BEAM-9258) [Python] PTransform that connects to Cloud DLP deidentification service
[ https://issues.apache.org/jira/browse/BEAM-9258?focusedWorklogId=386545&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-386545 ] ASF GitHub Bot logged work on BEAM-9258: Author: ASF GitHub Bot Created on: 13/Feb/20 10:52 Start Date: 13/Feb/20 10:52 Worklog Time Spent: 10m Work Description: mwalenia commented on issue #10849: [BEAM-9258] Integrate Google Cloud Data loss prevention functionality for Python SDK URL: https://github.com/apache/beam/pull/10849#issuecomment-585666957 R: @aaltay Can you take a look at this? Thanks! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 386545) Time Spent: 20m (was: 10m) > [Python] PTransform that connects to Cloud DLP deidentification service > --- > > Key: BEAM-9258 > URL: https://issues.apache.org/jira/browse/BEAM-9258 > Project: Beam > Issue Type: Sub-task > Components: io-py-gcp >Reporter: Michał Walenia >Assignee: Michał Walenia >Priority: Major > Time Spent: 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (BEAM-9258) [Python] PTransform that connects to Cloud DLP deidentification service
[ https://issues.apache.org/jira/browse/BEAM-9258?focusedWorklogId=386544&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-386544 ] ASF GitHub Bot logged work on BEAM-9258: Author: ASF GitHub Bot Created on: 13/Feb/20 10:51 Start Date: 13/Feb/20 10:51 Worklog Time Spent: 10m Work Description: mwalenia commented on pull request #10849: [BEAM-9258] Integrate Google Cloud Data loss prevention functionality for Python SDK URL: https://github.com/apache/beam/pull/10849 Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily: - [x] [**Choose reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and mention them in a comment (`R: @username`). - [x] Format the pull request title like `[BEAM-XXX] Fixes bug in ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue. - [ ] Update `CHANGES.md` with noteworthy changes. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). See the [Contributor Guide](https://beam.apache.org/contribute) for more tips on [how to make review process smoother](https://beam.apache.org/contribute/#make-reviewers-job-easier). Post-Commit Tests Status (on master branch) Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark --- | --- | --- | --- | --- | --- | --- | --- Go | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/) Java | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/)[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_SparkStructuredStreaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_SparkStructuredStreaming/lastCompletedBuild/) Python | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/)[![Build Status](https:/