[jira] [Work logged] (BEAM-6683) Add an integration test suite for cross-language transforms for Flink runner

2019-05-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-6683?focusedWorklogId=242213=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-242213
 ]

ASF GitHub Bot logged work on BEAM-6683:


Author: ASF GitHub Bot
Created on: 15/May/19 03:08
Start Date: 15/May/19 03:08
Worklog Time Spent: 10m 
  Work Description: ihji commented on issue #8174: [BEAM-6683] add 
createCrossLanguageValidatesRunner task
URL: https://github.com/apache/beam/pull/8174#issuecomment-492486294
 
 
   Run python precommit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 242213)
Time Spent: 5h 40m  (was: 5.5h)

> Add an integration test suite for cross-language transforms for Flink runner
> 
>
> Key: BEAM-6683
> URL: https://issues.apache.org/jira/browse/BEAM-6683
> Project: Beam
>  Issue Type: Test
>  Components: testing
>Reporter: Chamikara Jayalath
>Assignee: Heejong Lee
>Priority: Major
>  Time Spent: 5h 40m
>  Remaining Estimate: 0h
>
> We should add an integration test suite that covers following.
> (1) Currently available Java IO connectors that do not use UDFs work for 
> Python SDK on Flink runner.
> (2) Currently available Python IO connectors that do not use UDFs work for 
> Java SDK on Flink runner.
> (3) Currently available Java/Python pipelines work in a scalable manner for 
> cross-language pipelines (for example, try 10GB, 100GB input for 
> textio/avroio for Java and Python). 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-6683) Add an integration test suite for cross-language transforms for Flink runner

2019-05-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-6683?focusedWorklogId=242199=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-242199
 ]

ASF GitHub Bot logged work on BEAM-6683:


Author: ASF GitHub Bot
Created on: 15/May/19 02:28
Start Date: 15/May/19 02:28
Worklog Time Spent: 10m 
  Work Description: ihji commented on issue #8174: [BEAM-6683] add 
createCrossLanguageValidatesRunner task
URL: https://github.com/apache/beam/pull/8174#issuecomment-492479614
 
 
   Run java precommit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 242199)
Time Spent: 5.5h  (was: 5h 20m)

> Add an integration test suite for cross-language transforms for Flink runner
> 
>
> Key: BEAM-6683
> URL: https://issues.apache.org/jira/browse/BEAM-6683
> Project: Beam
>  Issue Type: Test
>  Components: testing
>Reporter: Chamikara Jayalath
>Assignee: Heejong Lee
>Priority: Major
>  Time Spent: 5.5h
>  Remaining Estimate: 0h
>
> We should add an integration test suite that covers following.
> (1) Currently available Java IO connectors that do not use UDFs work for 
> Python SDK on Flink runner.
> (2) Currently available Python IO connectors that do not use UDFs work for 
> Java SDK on Flink runner.
> (3) Currently available Java/Python pipelines work in a scalable manner for 
> cross-language pipelines (for example, try 10GB, 100GB input for 
> textio/avroio for Java and Python). 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-6988) TypeHints Py3 Error: test_non_function (apache_beam.typehints.typed_pipeline_test.MainInputTest) Fails on Python 3.7+

2019-05-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-6988?focusedWorklogId=242192=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-242192
 ]

ASF GitHub Bot logged work on BEAM-6988:


Author: ASF GitHub Bot
Created on: 15/May/19 02:18
Start Date: 15/May/19 02:18
Worklog Time Spent: 10m 
  Work Description: udim commented on issue #8530: [BEAM-6988] solved 
problem related to updates of the str object
URL: https://github.com/apache/beam/pull/8530#issuecomment-492477990
 
 
   In 
`apache_beam.typehints.typed_pipeline_test:MainInputTest.test_non_function`, 
`getcallargs_forhints` is called twice:
   
https://github.com/apache/beam/blob/ef4b2ef7e5fa2fb87e1491df82d2797947f51be9/sdks/python/apache_beam/transforms/ptransform.py#L761-L762
   
   The second time fails, since type_hints is equals to `((,), 
{})`, but inspect.getcallargs fails because it thinks str.strip has 2 required 
positional args (self and chars). I opened a bug for this 
(https://bugs.python.org/issue36920), but I believe we can use 
inspect.Signature instead in Python 3.
   
   The code generating the type hint is I believe:
   
https://github.com/apache/beam/blob/5c7ee600ba8560102f61622c894633955d208f87/sdks/python/apache_beam/typehints/decorators.py#L330-L331
   
   So this brings up some questions (cc: @robertwb)
   1. Are partial type hints okay?
   1. Is the type hints heuristic from decorators.py above useful? Should it 
stay?
   
   To summarize: the type hints generated for str.strip are partial and 
inspect.getcallargs is buggy and believes that str.strip has 2 required 
positional args instead of 1.
   I think that if we decide to not guess type hints for builtin functions in 
Python 3 (the code in decorators.py), then this will solve the issue.
   Another option is to replace inspect.getcallargs with 
inspect.signature(fn).bind(...), which seems to be less buggy, but only works 
for builtins in 3.7.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 242192)
Time Spent: 2h 50m  (was: 2h 40m)

> TypeHints Py3 Error: test_non_function 
> (apache_beam.typehints.typed_pipeline_test.MainInputTest) Fails on Python 3.7+
> -
>
> Key: BEAM-6988
> URL: https://issues.apache.org/jira/browse/BEAM-6988
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core
>Reporter: Robbe
>Assignee: niklas Hansson
>Priority: Major
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> {noformat}
> Traceback (most recent call last):
>  File 
> "/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/sdks/python/test-suites/tox/py37/build/srcs/sdks/python/apache_beam/typehints/typed_pipeline_test.py",
>  line 53, in test_non_function
>  result = ['xa', 'bbx', 'xcx'] | beam.Map(str.strip, 'x')
>  File 
> "/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/sdks/python/test-suites/tox/py37/build/srcs/sdks/python/apache_beam/transforms/ptransform.py",
>  line 510, in _ror_
>  result = p.apply(self, pvalueish, label)
>  File 
> "/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/sdks/python/test-suites/tox/py37/build/srcs/sdks/python/apache_beam/pipeline.py",
>  line 514, in apply
>  transform.type_check_inputs(pvalueish)
>  File 
> "/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/sdks/python/test-suites/tox/py37/build/srcs/sdks/python/apache_beam/transforms/ptransform.py",
>  line 753, in type_check_inputs
>  hints = getcallargs_forhints(argspec_fn, *type_hints[0], **type_hints[1])
>  File 
> "/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/sdks/python/test-suites/tox/py37/build/srcs/sdks/python/apache_beam/typehints/decorators.py",
>  line 283, in getcallargs_forhints
>  raise TypeCheckError(e)
>  apache_beam.typehints.decorators.TypeCheckError: strip() missing 1 required 
> positional argument: 'chars'{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-7174) Add transforms for modifying schemas

2019-05-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7174?focusedWorklogId=242180=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-242180
 ]

ASF GitHub Bot logged work on BEAM-7174:


Author: ASF GitHub Bot
Created on: 15/May/19 01:17
Start Date: 15/May/19 01:17
Worklog Time Spent: 10m 
  Work Description: reuvenlax commented on pull request #8425: [BEAM-7174] 
Add schema modification transforms
URL: https://github.com/apache/beam/pull/8425
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 242180)
Time Spent: 4h 20m  (was: 4h 10m)

> Add transforms for modifying schemas
> 
>
> Key: BEAM-7174
> URL: https://issues.apache.org/jira/browse/BEAM-7174
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-java-core
>Reporter: Reuven Lax
>Assignee: Reuven Lax
>Priority: Major
>  Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
> We need transforms to add fields, remove fields, and rename fields.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-6693) ApproximateUnique transform for Python SDK

2019-05-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-6693?focusedWorklogId=242175=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-242175
 ]

ASF GitHub Bot logged work on BEAM-6693:


Author: ASF GitHub Bot
Created on: 15/May/19 01:09
Start Date: 15/May/19 01:09
Worklog Time Spent: 10m 
  Work Description: aaltay commented on issue #8535: [BEAM-6693] 
ApproximateUnique transform for Python SDK
URL: https://github.com/apache/beam/pull/8535#issuecomment-492465971
 
 
   @Hannah-Jiang could you also please update the PR description?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 242175)
Time Spent: 4h 40m  (was: 4.5h)

> ApproximateUnique transform for Python SDK
> --
>
> Key: BEAM-6693
> URL: https://issues.apache.org/jira/browse/BEAM-6693
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Ahmet Altay
>Assignee: Hannah Jiang
>Priority: Minor
>  Time Spent: 4h 40m
>  Remaining Estimate: 0h
>
> Add a PTransform for estimating the number of distinct elements in a 
> PCollection and the number of distinct values associated with each key in a 
> PCollection KVs.
> it should offer the same API as its Java counterpart: 
> https://github.com/apache/beam/blob/11a977b8b26eff2274d706541127c19dc93131a2/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ApproximateUnique.java



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work started] (BEAM-7311) merge internal commits to beam open source trunk to prepare for the security patch

2019-05-14 Thread Hai Lu (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on BEAM-7311 started by Hai Lu.

> merge internal commits to beam open source trunk to prepare for the security 
> patch
> --
>
> Key: BEAM-7311
> URL: https://issues.apache.org/jira/browse/BEAM-7311
> Project: Beam
>  Issue Type: Task
>  Components: runner-samza
>Reporter: Hai Lu
>Assignee: Hai Lu
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Merge the following commits:
>  * add portable pipeline option and use that for job server driver
>  * minor refactor in server driver to allow potential code reuse
>  * miscellaneous fix on samza runne
>  ** pipeline life cycle listent to add pipeline optino in onInit
>  ** portable runner to support samza metrics reporter
>  ** add timeout for pipeline cancelation



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (BEAM-7311) merge internal commits to beam open source trunk to prepare for the security patch

2019-05-14 Thread Hai Lu (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hai Lu updated BEAM-7311:
-
Status: Open  (was: Triage Needed)

> merge internal commits to beam open source trunk to prepare for the security 
> patch
> --
>
> Key: BEAM-7311
> URL: https://issues.apache.org/jira/browse/BEAM-7311
> Project: Beam
>  Issue Type: Task
>  Components: runner-samza
>Reporter: Hai Lu
>Assignee: Hai Lu
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Merge the following commits:
>  * add portable pipeline option and use that for job server driver
>  * minor refactor in server driver to allow potential code reuse
>  * miscellaneous fix on samza runne
>  ** pipeline life cycle listent to add pipeline optino in onInit
>  ** portable runner to support samza metrics reporter
>  ** add timeout for pipeline cancelation



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-6988) TypeHints Py3 Error: test_non_function (apache_beam.typehints.typed_pipeline_test.MainInputTest) Fails on Python 3.7+

2019-05-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-6988?focusedWorklogId=242156=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-242156
 ]

ASF GitHub Bot logged work on BEAM-6988:


Author: ASF GitHub Bot
Created on: 15/May/19 00:33
Start Date: 15/May/19 00:33
Worklog Time Spent: 10m 
  Work Description: udim commented on issue #8530: [BEAM-6988] solved 
problem related to updates of the str object
URL: https://github.com/apache/beam/pull/8530#issuecomment-492459908
 
 
   getargspec and getfullargspec has not worked on builtin functions before 3.7.
   What we're getting in 3.5 and 3.6 is the result of:
   ```
   >>> typehints.decorators.getfullargspec(str.strip.__call__)
   FullArgSpec(args=['self'], varargs='args', varkw='kwargs', defaults=None, 
kwonlyargs=[], kwonlydefaults=None, annotations={})
   ```
   Note that this is a bogus argspec with `self`, `args`, and `kwargs`.
   
   Does anyone know what corner case these lines handle?
   
https://github.com/apache/beam/blob/5c7ee600ba8560102f61622c894633955d208f87/sdks/python/apache_beam/typehints/decorators.py#L131-L133
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 242156)
Time Spent: 2h 40m  (was: 2.5h)

> TypeHints Py3 Error: test_non_function 
> (apache_beam.typehints.typed_pipeline_test.MainInputTest) Fails on Python 3.7+
> -
>
> Key: BEAM-6988
> URL: https://issues.apache.org/jira/browse/BEAM-6988
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core
>Reporter: Robbe
>Assignee: niklas Hansson
>Priority: Major
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> {noformat}
> Traceback (most recent call last):
>  File 
> "/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/sdks/python/test-suites/tox/py37/build/srcs/sdks/python/apache_beam/typehints/typed_pipeline_test.py",
>  line 53, in test_non_function
>  result = ['xa', 'bbx', 'xcx'] | beam.Map(str.strip, 'x')
>  File 
> "/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/sdks/python/test-suites/tox/py37/build/srcs/sdks/python/apache_beam/transforms/ptransform.py",
>  line 510, in _ror_
>  result = p.apply(self, pvalueish, label)
>  File 
> "/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/sdks/python/test-suites/tox/py37/build/srcs/sdks/python/apache_beam/pipeline.py",
>  line 514, in apply
>  transform.type_check_inputs(pvalueish)
>  File 
> "/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/sdks/python/test-suites/tox/py37/build/srcs/sdks/python/apache_beam/transforms/ptransform.py",
>  line 753, in type_check_inputs
>  hints = getcallargs_forhints(argspec_fn, *type_hints[0], **type_hints[1])
>  File 
> "/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/sdks/python/test-suites/tox/py37/build/srcs/sdks/python/apache_beam/typehints/decorators.py",
>  line 283, in getcallargs_forhints
>  raise TypeCheckError(e)
>  apache_beam.typehints.decorators.TypeCheckError: strip() missing 1 required 
> positional argument: 'chars'{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-7302) Dependencies are broken in SNAPSHOT pom files

2019-05-14 Thread Michael Luckey (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-7302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16839886#comment-16839886
 ] 

Michael Luckey commented on BEAM-7302:
--

Unfortunately, as long as we use shadow plugin, we probably need to manually 
configure our deps. See 
https://imperceptiblethoughts.com/shadow/publishing/#publishing-with-maven-plugin

> Dependencies are broken in SNAPSHOT pom files
> -
>
> Key: BEAM-7302
> URL: https://issues.apache.org/jira/browse/BEAM-7302
> Project: Beam
>  Issue Type: Bug
>  Components: build-system
>Affects Versions: 2.14.0
>Reporter: Ismaël Mejía
>Assignee: Michael Luckey
>Priority: Major
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> The generated pom in the SNAPSHOTS repository points to dependencies that 
> don't have the correct name, for example [the beam-sdks-java-core 
> pom|https://repository.apache.org/content/repositories/snapshots/org/apache/beam/beam-sdks-java-core/2.14.0-SNAPSHOT/beam-sdks-java-core-2.14.0-20190514.072148-6.pom]
>  points to
> {code}
> 
>   beam.model
>   pipeline
>   2.14.0-SNAPSHOT
>   compile
> 
> {code}
> but such groupId and artifactId do not exist (and have not existed in the 
> past).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-7311) merge internal commits to beam open source trunk to prepare for the security patch

2019-05-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7311?focusedWorklogId=242155=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-242155
 ]

ASF GitHub Bot logged work on BEAM-7311:


Author: ASF GitHub Bot
Created on: 15/May/19 00:14
Start Date: 15/May/19 00:14
Worklog Time Spent: 10m 
  Work Description: lhaiesp commented on issue #8582: [BEAM-7311] merge 
internal commits to beam open source trunk to prepare for the security patch
URL: https://github.com/apache/beam/pull/8582#issuecomment-492456227
 
 
   @xinyuiscool  for review
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 242155)
Time Spent: 20m  (was: 10m)

> merge internal commits to beam open source trunk to prepare for the security 
> patch
> --
>
> Key: BEAM-7311
> URL: https://issues.apache.org/jira/browse/BEAM-7311
> Project: Beam
>  Issue Type: Task
>  Components: runner-samza
>Reporter: Hai Lu
>Assignee: Hai Lu
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Merge the following commits:
>  * add portable pipeline option and use that for job server driver
>  * minor refactor in server driver to allow potential code reuse
>  * miscellaneous fix on samza runne
>  ** pipeline life cycle listent to add pipeline optino in onInit
>  ** portable runner to support samza metrics reporter
>  ** add timeout for pipeline cancelation



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-7311) merge internal commits to beam open source trunk to prepare for the security patch

2019-05-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7311?focusedWorklogId=242154=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-242154
 ]

ASF GitHub Bot logged work on BEAM-7311:


Author: ASF GitHub Bot
Created on: 15/May/19 00:13
Start Date: 15/May/19 00:13
Worklog Time Spent: 10m 
  Work Description: lhaiesp commented on pull request #8582: [BEAM-7311] 
merge internal commits to beam open source trunk to prepare for the security 
patch
URL: https://github.com/apache/beam/pull/8582
 
 
   merge internal commits to beam open source trunk to prepare for the security 
patch.
   
   Merge the following commits:
   - add portable pipeline option and use that for job server driver
   - minor refactor in server driver to allow potential code reuse
   - miscellaneous fix on samza runne
 - pipeline life cycle listent to add pipeline optino in onInit
 - portable runner to support samza metrics reporter
 - add timeout for pipeline cancelation
   
   
   
   Thank you for your contribution! Follow this checklist to help us 
incorporate your contribution quickly and easily:
   
- [ ] [**Choose 
reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and 
mention them in a comment (`R: @username`).
- [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in 
ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA 
issue, if applicable. This will automatically link the pull request to the 
issue.
- [ ] If this contribution is large, please file an Apache [Individual 
Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   Post-Commit Tests Status (on master branch)
   

   
   Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark
   --- | --- | --- | --- | --- | --- | --- | ---
   Go | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/)
 | --- | --- | --- | --- | --- | ---
   Java | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)
   Python | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Python_Verify/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python_Verify/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Python3_Verify/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python3_Verify/lastCompletedBuild/)
 | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Py_VR_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Py_VR_Dataflow/lastCompletedBuild/)
  [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Py_ValCont/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Py_ValCont/lastCompletedBuild/)
 | [![Build 

[jira] [Work logged] (BEAM-7103) Adding AvroGenericCoder for simple dict type cross-language data transfer

2019-05-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7103?focusedWorklogId=242153=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-242153
 ]

ASF GitHub Bot logged work on BEAM-7103:


Author: ASF GitHub Bot
Created on: 15/May/19 00:11
Start Date: 15/May/19 00:11
Worklog Time Spent: 10m 
  Work Description: ihji commented on issue #8342: [BEAM-7103] Adding 
AvroCoderTranslator for cross-language data transfer
URL: https://github.com/apache/beam/pull/8342#issuecomment-492455698
 
 
   Figured out that we can just use AvroCoder instead of creating 
AvroGenericCoder in Java SDK.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 242153)
Time Spent: 2h 20m  (was: 2h 10m)

> Adding AvroGenericCoder for simple dict type cross-language data transfer
> -
>
> Key: BEAM-7103
> URL: https://issues.apache.org/jira/browse/BEAM-7103
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-java-core, sdk-py-core
>Reporter: Heejong Lee
>Assignee: Heejong Lee
>Priority: Major
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> Adding AvroGenericCoder for simple dict type cross-language data transfer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-7311) merge internal commits to beam open source trunk to prepare for the security patch

2019-05-14 Thread Hai Lu (JIRA)
Hai Lu created BEAM-7311:


 Summary: merge internal commits to beam open source trunk to 
prepare for the security patch
 Key: BEAM-7311
 URL: https://issues.apache.org/jira/browse/BEAM-7311
 Project: Beam
  Issue Type: Task
  Components: runner-samza
Reporter: Hai Lu
Assignee: Hai Lu


Merge the following commits:
 * add portable pipeline option and use that for job server driver
 * minor refactor in server driver to allow potential code reuse
 * miscellaneous fix on samza runne
 ** pipeline life cycle listent to add pipeline optino in onInit
 ** portable runner to support samza metrics reporter
 ** add timeout for pipeline cancelation



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-7310) Confluent Schema Registry support in KafkaIO

2019-05-14 Thread Yohei Onishi (JIRA)
Yohei Onishi created BEAM-7310:
--

 Summary: Confluent Schema Registry support in KafkaIO
 Key: BEAM-7310
 URL: https://issues.apache.org/jira/browse/BEAM-7310
 Project: Beam
  Issue Type: Improvement
  Components: io-java-kafka
Affects Versions: 2.12.0
Reporter: Yohei Onishi


Confluent Schema Registry is useful when we manage Avro Schema but  KafkaIO 
does not support Confluent Schema Registry as discussed here.
https://stackoverflow.com/questions/56035121/unable-to-connect-from-dataflow-job-to-schema-registry-when-schema-registry-requ
https://lists.apache.org/thread.html/7695fccddebd08733b80ae1e43b79b636b63cd5fe583a2bdeecda6c4@%3Cuser.beam.apache.org%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-7103) Adding AvroGenericCoder for simple dict type cross-language data transfer

2019-05-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7103?focusedWorklogId=242151=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-242151
 ]

ASF GitHub Bot logged work on BEAM-7103:


Author: ASF GitHub Bot
Created on: 15/May/19 00:04
Start Date: 15/May/19 00:04
Worklog Time Spent: 10m 
  Work Description: ihji commented on pull request #8342: [BEAM-7103] 
Adding AvroGenericCoder for cross-language data transfer
URL: https://github.com/apache/beam/pull/8342#discussion_r284045574
 
 

 ##
 File path: 
sdks/java/core/src/main/java/org/apache/beam/sdk/coders/AvroGenericCoder.java
 ##
 @@ -0,0 +1,32 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.beam.sdk.coders;
+
+import org.apache.avro.Schema;
+import org.apache.avro.generic.GenericRecord;
+
+/** GenericRecord Avro Coder for simple dict type cross-language data 
transfer. */
+public class AvroGenericCoder extends AvroCoder {
+  protected AvroGenericCoder(Schema schema) {
 
 Review comment:
   I think we can follow this up in a separate issue.
   https://issues.apache.org/jira/browse/BEAM-7309
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 242151)
Time Spent: 2h 10m  (was: 2h)

> Adding AvroGenericCoder for simple dict type cross-language data transfer
> -
>
> Key: BEAM-7103
> URL: https://issues.apache.org/jira/browse/BEAM-7103
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-java-core, sdk-py-core
>Reporter: Heejong Lee
>Assignee: Heejong Lee
>Priority: Major
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> Adding AvroGenericCoder for simple dict type cross-language data transfer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-7174) Add transforms for modifying schemas

2019-05-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7174?focusedWorklogId=242152=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-242152
 ]

ASF GitHub Bot logged work on BEAM-7174:


Author: ASF GitHub Bot
Created on: 15/May/19 00:04
Start Date: 15/May/19 00:04
Worklog Time Spent: 10m 
  Work Description: robinyqiu commented on issue #8425: [BEAM-7174] Add 
schema modification transforms
URL: https://github.com/apache/beam/pull/8425#issuecomment-492454565
 
 
   LGTM. Thanks for the fix.
   
   [BEAM-7301](https://issues.apache.org/jira/browse/BEAM-7301) is tracking the 
remaining field ordering issue, and I am happy to work on the fix for that once 
I can find some time.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 242152)
Time Spent: 4h 10m  (was: 4h)

> Add transforms for modifying schemas
> 
>
> Key: BEAM-7174
> URL: https://issues.apache.org/jira/browse/BEAM-7174
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-java-core
>Reporter: Reuven Lax
>Assignee: Reuven Lax
>Priority: Major
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> We need transforms to add fields, remove fields, and rename fields.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-7309) Adding AvroCoder to StandardCoders

2019-05-14 Thread Heejong Lee (JIRA)
Heejong Lee created BEAM-7309:
-

 Summary: Adding AvroCoder to StandardCoders
 Key: BEAM-7309
 URL: https://issues.apache.org/jira/browse/BEAM-7309
 Project: Beam
  Issue Type: Improvement
  Components: sdk-java-core, sdk-py-core
Reporter: Heejong Lee


Should AvroCoder be a standard coder? It could be useful as a common codec for 
a simple dict type element.

https://github.com/apache/beam/pull/8342#discussion_r284025317



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-7103) Adding AvroGenericCoder for simple dict type cross-language data transfer

2019-05-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7103?focusedWorklogId=242149=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-242149
 ]

ASF GitHub Bot logged work on BEAM-7103:


Author: ASF GitHub Bot
Created on: 14/May/19 23:58
Start Date: 14/May/19 23:58
Worklog Time Spent: 10m 
  Work Description: ihji commented on pull request #8342: [BEAM-7103] 
Adding AvroGenericCoder for cross-language data transfer
URL: https://github.com/apache/beam/pull/8342#discussion_r284044634
 
 

 ##
 File path: sdks/python/apache_beam/coders/avro_generic_coder.py
 ##
 @@ -0,0 +1,99 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""avro generic record coder for simple dict-type x-language data transfer."""
+
+from __future__ import absolute_import
+
+import json
+from io import BytesIO
+
+from fastavro import parse_schema
+from fastavro import schemaless_reader
+from fastavro import schemaless_writer
+
+from apache_beam.coders.coder_impl import SimpleCoderImpl
+from apache_beam.coders.coders import Coder
+from apache_beam.coders.coders import FastCoder
+
+AVRO_GENERIC_CODER_URN = "beam:coder:avro:generic:v1"
+
+__all__ = ['AvroGenericCoder', 'AvroGenericRecord']
+
+
+class AvroGenericCoder(FastCoder):
+  """A coder used for AvroGenericRecord values."""
+
+  def __init__(self, schema):
+self.schema = schema
+
+  def _create_impl(self):
+return AvroGenericCoderImpl(self.schema)
+
+  def is_deterministic(self):
+# TODO: need to confirm if it's deterministic
+return False
 
 Review comment:
   https://issues.apache.org/jira/browse/BEAM-7308
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 242149)
Time Spent: 2h  (was: 1h 50m)

> Adding AvroGenericCoder for simple dict type cross-language data transfer
> -
>
> Key: BEAM-7103
> URL: https://issues.apache.org/jira/browse/BEAM-7103
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-java-core, sdk-py-core
>Reporter: Heejong Lee
>Assignee: Heejong Lee
>Priority: Major
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> Adding AvroGenericCoder for simple dict type cross-language data transfer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-7308) Add determinism checker for Python AvroCoder

2019-05-14 Thread Heejong Lee (JIRA)
Heejong Lee created BEAM-7308:
-

 Summary: Add determinism checker for Python AvroCoder
 Key: BEAM-7308
 URL: https://issues.apache.org/jira/browse/BEAM-7308
 Project: Beam
  Issue Type: Improvement
  Components: sdk-py-core
Reporter: Heejong Lee


Add determinism checker for Python AvroCoder



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-7301) Beam transforms reorder fields

2019-05-14 Thread Yueyang Qiu (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-7301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16839875#comment-16839875
 ] 

Yueyang Qiu commented on BEAM-7301:
---

Context: [https://github.com/apache/beam/pull/8425]

> Beam transforms reorder fields
> --
>
> Key: BEAM-7301
> URL: https://issues.apache.org/jira/browse/BEAM-7301
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-java-core
>Reporter: Reuven Lax
>Assignee: Yueyang Qiu
>Priority: Major
>
> Currently transforms such as Select, DropFields, RenameFields, and AddFields 
> can create schemas with unexpected order. The problem is that 
> FieldAccessDescriptor stores top-level fields and nested fields separately, 
> so there's no way to tell the relative order between them. To fix this we 
> should refactor FieldAccessDescriptor: instead of storing these separately it 
> should store a single list, where each item in the list might optionally have 
> a nested FieldAccessDescriptor.
> Expected behavior from the transforms:
>    DropFields: preserves order in original schema
>    RenameFields: preserves order in original schema
>    AddFields: adds fields in order specified. If multiple nested fields are 
> selected, the first reference to the top field wins (e.g. adding "a.b", "c", 
> "a.d" results in adding a before c.
>   Select: Select fields in the order specified.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-7043) Add DynamoDBIO

2019-05-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7043?focusedWorklogId=242138=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-242138
 ]

ASF GitHub Bot logged work on BEAM-7043:


Author: ASF GitHub Bot
Created on: 14/May/19 23:26
Start Date: 14/May/19 23:26
Worklog Time Spent: 10m 
  Work Description: cmachgodaddy commented on pull request #8390:  
[BEAM-7043] Add DynamoDBIO
URL: https://github.com/apache/beam/pull/8390#discussion_r284038159
 
 

 ##
 File path: 
sdks/java/io/amazon-web-services/src/main/java/org/apache/beam/sdk/io/aws/dynamodb/AttributeValueCoder.java
 ##
 @@ -0,0 +1,166 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.beam.sdk.io.aws.dynamodb;
+
+import com.amazonaws.services.dynamodbv2.model.AttributeValue;
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.OutputStream;
+import java.nio.ByteBuffer;
+import java.util.List;
+import java.util.stream.Collectors;
+import org.apache.beam.sdk.coders.AtomicCoder;
+import org.apache.beam.sdk.coders.BooleanCoder;
+import org.apache.beam.sdk.coders.ByteArrayCoder;
+import org.apache.beam.sdk.coders.Coder;
+import org.apache.beam.sdk.coders.CoderException;
+import org.apache.beam.sdk.coders.ListCoder;
+import org.apache.beam.sdk.coders.MapCoder;
+import org.apache.beam.sdk.coders.StringUtf8Coder;
+
+/** A {@link Coder} that serializes and deserializes the {@link 
AttributeValue} objects. */
+public class AttributeValueCoder extends AtomicCoder {
+
+  /** Data type of each value type in AttributeValue object. */
+  private enum AttributeValueType {
+s, // for String
+n, // for Number
+b, // for Byte
+sS, // for List of String
+nS, // for List of Number
+bS, // for List of Byte
+m, // for Map of String and AttributeValue
+l, // for list of AttributeValued
 
 Review comment:
   Good catch @lukecwik . And thank for your review!
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 242138)
Time Spent: 7.5h  (was: 7h 20m)

> Add DynamoDBIO
> --
>
> Key: BEAM-7043
> URL: https://issues.apache.org/jira/browse/BEAM-7043
> Project: Beam
>  Issue Type: New Feature
>  Components: io-java-aws
>Reporter: Cam Mach
>Assignee: Cam Mach
>Priority: Minor
>  Time Spent: 7.5h
>  Remaining Estimate: 0h
>
> Currently we don't have any feature to write data to AWS DynamoDB. This 
> feature will enable us to send data to DynamoDB



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-6959) Run Go SDK Post Commit tests against the Flink Runner.

2019-05-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-6959?focusedWorklogId=242123=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-242123
 ]

ASF GitHub Bot logged work on BEAM-6959:


Author: ASF GitHub Bot
Created on: 14/May/19 23:17
Start Date: 14/May/19 23:17
Worklog Time Spent: 10m 
  Work Description: ibzib commented on issue #8531: [BEAM-6959] Add Flink 
tests for Go SDK
URL: https://github.com/apache/beam/pull/8531#issuecomment-492445364
 
 
   Run Go PostCommit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 242123)
Time Spent: 1h 40m  (was: 1.5h)

> Run Go SDK  Post Commit tests against the Flink Runner.
> ---
>
> Key: BEAM-6959
> URL: https://issues.apache.org/jira/browse/BEAM-6959
> Project: Beam
>  Issue Type: Sub-task
>  Components: runner-flink, sdk-go, testing
>Reporter: Robert Burke
>Assignee: Kyle Weaver
>Priority: Minor
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> See parent task BEAM-6958



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-6959) Run Go SDK Post Commit tests against the Flink Runner.

2019-05-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-6959?focusedWorklogId=242122=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-242122
 ]

ASF GitHub Bot logged work on BEAM-6959:


Author: ASF GitHub Bot
Created on: 14/May/19 23:16
Start Date: 14/May/19 23:16
Worklog Time Spent: 10m 
  Work Description: ibzib commented on pull request #8531: [BEAM-6959] Add 
Flink tests for Go SDK
URL: https://github.com/apache/beam/pull/8531#discussion_r284036025
 
 

 ##
 File path: sdks/go/test/run_integration_tests.sh
 ##
 @@ -71,21 +130,42 @@ docker images | grep $TAG
 # Push the container
 gcloud docker -- push $CONTAINER
 
-DATAFLOW_WORKER_JAR=$(find 
./runners/google-cloud-dataflow-java/worker/build/libs/beam-runners-google-cloud-dataflow-java-fn-api-worker-*.jar)
-echo "Using Dataflow worker jar: $DATAFLOW_WORKER_JAR"
+if [[ "$RUNNER" == "dataflow" ]]; then
+  if [[ -z "$DATAFLOW_WORKER_JAR" ]]; then
+DATAFLOW_WORKER_JAR=$(find 
./runners/google-cloud-dataflow-java/worker/build/libs/beam-runners-google-cloud-dataflow-java-fn-api-worker-*.jar)
+  fi
+  echo "Using Dataflow worker jar: $DATAFLOW_WORKER_JAR"
+elif [[ "$RUNNER" == "flink" ]]; then
+  if [[ -z "$ENDPOINT" ]]; then
+JOB_PORT=$(python ./sdks/python/apache_beam/utils/ports.py)
 
 Review comment:
   That was getting too messy, so I opted to just include a simple Python 
script inline in the interest of not overcomplicating things.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 242122)
Time Spent: 1.5h  (was: 1h 20m)

> Run Go SDK  Post Commit tests against the Flink Runner.
> ---
>
> Key: BEAM-6959
> URL: https://issues.apache.org/jira/browse/BEAM-6959
> Project: Beam
>  Issue Type: Sub-task
>  Components: runner-flink, sdk-go, testing
>Reporter: Robert Burke
>Assignee: Kyle Weaver
>Priority: Minor
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> See parent task BEAM-6958



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-6908) Add Python3 performance benchmarks

2019-05-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-6908?focusedWorklogId=242118=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-242118
 ]

ASF GitHub Bot logged work on BEAM-6908:


Author: ASF GitHub Bot
Created on: 14/May/19 23:08
Start Date: 14/May/19 23:08
Worklog Time Spent: 10m 
  Work Description: tvalentyn commented on pull request #8518: [BEAM-6908] 
Refactor Python performance test groovy file for easy configuration
URL: https://github.com/apache/beam/pull/8518#discussion_r284034132
 
 

 ##
 File path: .test-infra/jenkins/job_PerformanceTests_Python.groovy
 ##
 @@ -18,46 +18,107 @@
 
 import CommonJobProperties as commonJobProperties
 
-// This job runs the Beam Python performance tests on PerfKit Benchmarker.
-job('beam_PerformanceTests_Python'){
-  // Set default Beam job properties.
-  commonJobProperties.setTopLevelMainJobProperties(delegate)
-
-  // Run job in postcommit every 6 hours, don't trigger every push.
-  commonJobProperties.setAutoJob(
-  delegate,
-  'H */6 * * *')
-
-  // Allows triggering this build against pull requests.
-  commonJobProperties.enablePhraseTriggeringFromPullRequest(
-  delegate,
-  'Python SDK Performance Test',
-  'Run Python Performance Test')
-
-  def pipelineArgs = [
-  project: 'apache-beam-testing',
-  staging_location: 'gs://temp-storage-for-end-to-end-tests/staging-it',
-  temp_location: 'gs://temp-storage-for-end-to-end-tests/temp-it',
-  output: 'gs://temp-storage-for-end-to-end-tests/py-it-cloud/output'
-  ]
-  def pipelineArgList = []
-  pipelineArgs.each({
-key, value -> pipelineArgList.add("--$key=$value")
-  })
-  def pipelineArgsJoined = pipelineArgList.join(',')
-
-  def argMap = [
-  beam_sdk : 'python',
-  benchmarks   : 'beam_integration_benchmark',
-  bigquery_table   : 'beam_performance.wordcount_py_pkb_results',
-  beam_it_class: 
'apache_beam.examples.wordcount_it_test:WordCountIT.test_wordcount_it',
-  beam_it_module   : 'sdks/python',
-  beam_prebuilt: 'true',  // skip beam prebuild
-  beam_python_sdk_location : 'build/apache-beam.tar.gz',
-  beam_runner  : 'TestDataflowRunner',
-  beam_it_timeout  : '1200',
-  beam_it_args : pipelineArgsJoined,
-  ]
-
-  commonJobProperties.buildPerformanceTest(delegate, argMap)
+
+class PerformanceTestConfigurations {
+  String jobName
+  String jobDescription
+  String jobTriggerPhrase
+  String buildSchedule = 'H */6 * * *'  // every 6 hours
+  String benchmarkName = 'beam_integration_benchmark'
+  String sdk = 'python'
+  String bigqueryTable
+  String itClass
+  String itModule
 
 Review comment:
   Per offline discussion, let's add a comment here:
   Gradle project that defines 'runIntegrationTest' task. This task is executed 
by Perfkit Beam benchmark launcher. 
   This task can be added by enablePythonPerformanceTest() defined in 
BeamModulePlugin. 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 242118)
Time Spent: 12h 40m  (was: 12.5h)

> Add Python3 performance benchmarks
> --
>
> Key: BEAM-6908
> URL: https://issues.apache.org/jira/browse/BEAM-6908
> Project: Beam
>  Issue Type: Sub-task
>  Components: testing
>Reporter: Mark Liu
>Assignee: Mark Liu
>Priority: Major
>  Time Spent: 12h 40m
>  Remaining Estimate: 0h
>
> Similar to 
> [beam_PerformanceTests_Python|https://builds.apache.org/view/A-D/view/Beam/view/PerformanceTests/job/beam_PerformanceTests_Python/],
>  we want to have a Python3 benchmark running on Jenkins to detect performance 
> regression during code adoption.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-4046) Decouple gradle project names and maven artifact ids

2019-05-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4046?focusedWorklogId=242102=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-242102
 ]

ASF GitHub Bot logged work on BEAM-4046:


Author: ASF GitHub Bot
Created on: 14/May/19 22:49
Start Date: 14/May/19 22:49
Worklog Time Spent: 10m 
  Work Description: adude3141 commented on issue #8581: [BEAM-4046] revert 
test infra to be compatible with pending PRs
URL: https://github.com/apache/beam/pull/8581#issuecomment-492439215
 
 
   Thx, @angoenka for your help here!
   
   Created https://issues.apache.org/jira/browse/BEAM-7307
   
   And sorry for the inconvenience.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 242102)
Time Spent: 30h 40m  (was: 30.5h)

> Decouple gradle project names and maven artifact ids
> 
>
> Key: BEAM-4046
> URL: https://issues.apache.org/jira/browse/BEAM-4046
> Project: Beam
>  Issue Type: Sub-task
>  Components: build-system
>Reporter: Kenneth Knowles
>Assignee: Michael Luckey
>Priority: Major
>  Time Spent: 30h 40m
>  Remaining Estimate: 0h
>
> In our first draft, we had gradle projects like {{":beam-sdks-java-core"}}. 
> It is clumsy and requires a hacky settings.gradle that is not idiomatic.
> In our second draft, we changed them to names that work well with Gradle, 
> like {{":sdks:java:core"}}. This caused Maven artifact IDs to be wonky.
> In our third draft, we regressed to the first draft to get the Maven artifact 
> ids right.
> These should be able to be decoupled. It seems there are many StackOverflow 
> questions on the subject.
> Since it is unidiomatic and a poor user experience, if it does turn out to be 
> mandatory then it needs to be documented inline everywhere - the 
> settings.gradle should say why it is so bizarre, and each build.gradle should 
> indicate what its project id is.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-7307) Revert test-infra to new project layout

2019-05-14 Thread Michael Luckey (JIRA)
Michael Luckey created BEAM-7307:


 Summary: Revert test-infra to new project layout
 Key: BEAM-7307
 URL: https://issues.apache.org/jira/browse/BEAM-7307
 Project: Beam
  Issue Type: Task
  Components: build-system
Reporter: Michael Luckey
Assignee: Michael Luckey


To enable release we needed to revert test-infra changes done in BEAM-4046.

After release is done, we should revert changes from 
https://github.com/apache/beam/pull/8581

Note: Also fix typo for samba validatesRunner, see 
https://github.com/apache/beam/pull/8581/files#diff-9591f0d06e82e711681fd77ed287578b



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (BEAM-7251) Testing BigQuery client fails queries if job results aren't immediately available

2019-05-14 Thread Udi Meiri (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udi Meiri resolved BEAM-7251.
-
   Resolution: Fixed
Fix Version/s: Not applicable

> Testing BigQuery client fails queries if job results aren't immediately 
> available
> -
>
> Key: BEAM-7251
> URL: https://issues.apache.org/jira/browse/BEAM-7251
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp
>Reporter: Udi Meiri
>Assignee: Udi Meiri
>Priority: Major
> Fix For: Not applicable
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Correction: the test client is using a synchronous query with a default 
> timeout of 10s: 
> https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs/query
> This matches the timestamps below (5:29:19 to 5:29:29).
> Also note that this this method only returns the first page of results.
> ---
> Adding functionality to fetch query results should solve this issue, which is 
> probably causing test flakiness.
> Log:
> May 05, 2019 5:29:19 PM org.apache.beam.runners.dataflow.TestDataflowRunner 
> checkForPAssertSuccess
> INFO: Success result for Dataflow job 
> 2019-05-05_17_25_26-4118012232925193147. Found 0 success, 0 failures out of 0 
> expected assertions.
> May 05, 2019 5:29:19 PM org.apache.beam.sdk.io.gcp.testing.BigqueryMatcher 
> matchesSafely
> INFO: Verifying Bigquery data
> May 05, 2019 5:29:29 PM 
> com.google.cloud.dataflow.testing.DataflowJUnitTestRunner main
> SEVERE: 
> testE2eBigQueryTornadoesWithStorageApi(org.apache.beam.examples.cookbook.BigQueryTornadoesIT)
> java.lang.AssertionError: 
> Expected: Expected checksum is (1ab4c7ec460b94bbb3c3885b178bf0e6bed56e1f)
>  but: The query job hasn't completed. Got response: 
> {"jobComplete":false,"jobReference":{"jobId":"job_cZkICLalRsrnivu78BX1y3UwMhIz","location":"US","projectId":"xxx"},"kind":"bigquery#queryResponse"}
>   at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
>   at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:8)
>   at 
> org.apache.beam.runners.dataflow.TestDataflowRunner.run(TestDataflowRunner.java:138)
>   at 
> org.apache.beam.runners.dataflow.TestDataflowRunner.run(TestDataflowRunner.java:90)
>   at 
> org.apache.beam.runners.dataflow.TestDataflowRunner.run(TestDataflowRunner.java:55)
>   at org.apache.beam.sdk.Pipeline.run(Pipeline.java:313)
>   at org.apache.beam.sdk.Pipeline.run(Pipeline.java:299)
>   at 
> org.apache.beam.examples.cookbook.BigQueryTornadoes.runBigQueryTornadoes(BigQueryTornadoes.java:199)
>   at 
> org.apache.beam.examples.cookbook.BigQueryTornadoesIT.runE2EBigQueryTornadoesTest(BigQueryTornadoesIT.java:70)
>   at 
> org.apache.beam.examples.cookbook.BigQueryTornadoesIT.testE2eBigQueryTornadoesWithStorageApi(BigQueryTornadoesIT.java:95)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
>   at org.junit.runner.JUnitCore.run(JUnitCore.java:115)
>   at 
> com.google.cloud.dataflow.testing.DataflowJUnitTestRunner.main(DataflowJUnitTestRunner.java:145)
> Exception in thread "main" java.lang.IllegalStateException: Tests failed, 
> check output logs for details.
>   at 
> 

[jira] [Work logged] (BEAM-7251) Testing BigQuery client fails queries if job results aren't immediately available

2019-05-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7251?focusedWorklogId=242101=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-242101
 ]

ASF GitHub Bot logged work on BEAM-7251:


Author: ASF GitHub Bot
Created on: 14/May/19 22:41
Start Date: 14/May/19 22:41
Worklog Time Spent: 10m 
  Work Description: udim commented on pull request #8536: [BEAM-7251] 
Increase timeout for test BQ queries.
URL: https://github.com/apache/beam/pull/8536
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 242101)
Time Spent: 1h 40m  (was: 1.5h)

> Testing BigQuery client fails queries if job results aren't immediately 
> available
> -
>
> Key: BEAM-7251
> URL: https://issues.apache.org/jira/browse/BEAM-7251
> Project: Beam
>  Issue Type: Bug
>  Components: io-java-gcp
>Reporter: Udi Meiri
>Assignee: Udi Meiri
>Priority: Major
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Correction: the test client is using a synchronous query with a default 
> timeout of 10s: 
> https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs/query
> This matches the timestamps below (5:29:19 to 5:29:29).
> Also note that this this method only returns the first page of results.
> ---
> Adding functionality to fetch query results should solve this issue, which is 
> probably causing test flakiness.
> Log:
> May 05, 2019 5:29:19 PM org.apache.beam.runners.dataflow.TestDataflowRunner 
> checkForPAssertSuccess
> INFO: Success result for Dataflow job 
> 2019-05-05_17_25_26-4118012232925193147. Found 0 success, 0 failures out of 0 
> expected assertions.
> May 05, 2019 5:29:19 PM org.apache.beam.sdk.io.gcp.testing.BigqueryMatcher 
> matchesSafely
> INFO: Verifying Bigquery data
> May 05, 2019 5:29:29 PM 
> com.google.cloud.dataflow.testing.DataflowJUnitTestRunner main
> SEVERE: 
> testE2eBigQueryTornadoesWithStorageApi(org.apache.beam.examples.cookbook.BigQueryTornadoesIT)
> java.lang.AssertionError: 
> Expected: Expected checksum is (1ab4c7ec460b94bbb3c3885b178bf0e6bed56e1f)
>  but: The query job hasn't completed. Got response: 
> {"jobComplete":false,"jobReference":{"jobId":"job_cZkICLalRsrnivu78BX1y3UwMhIz","location":"US","projectId":"xxx"},"kind":"bigquery#queryResponse"}
>   at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
>   at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:8)
>   at 
> org.apache.beam.runners.dataflow.TestDataflowRunner.run(TestDataflowRunner.java:138)
>   at 
> org.apache.beam.runners.dataflow.TestDataflowRunner.run(TestDataflowRunner.java:90)
>   at 
> org.apache.beam.runners.dataflow.TestDataflowRunner.run(TestDataflowRunner.java:55)
>   at org.apache.beam.sdk.Pipeline.run(Pipeline.java:313)
>   at org.apache.beam.sdk.Pipeline.run(Pipeline.java:299)
>   at 
> org.apache.beam.examples.cookbook.BigQueryTornadoes.runBigQueryTornadoes(BigQueryTornadoes.java:199)
>   at 
> org.apache.beam.examples.cookbook.BigQueryTornadoesIT.runE2EBigQueryTornadoesTest(BigQueryTornadoesIT.java:70)
>   at 
> org.apache.beam.examples.cookbook.BigQueryTornadoesIT.testE2eBigQueryTornadoesWithStorageApi(BigQueryTornadoesIT.java:95)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at 

[jira] [Work logged] (BEAM-4046) Decouple gradle project names and maven artifact ids

2019-05-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4046?focusedWorklogId=242100=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-242100
 ]

ASF GitHub Bot logged work on BEAM-4046:


Author: ASF GitHub Bot
Created on: 14/May/19 22:38
Start Date: 14/May/19 22:38
Worklog Time Spent: 10m 
  Work Description: angoenka commented on pull request #8581: [BEAM-4046] 
revert test infra to be compatible with pending PRs
URL: https://github.com/apache/beam/pull/8581
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 242100)
Time Spent: 30.5h  (was: 30h 20m)

> Decouple gradle project names and maven artifact ids
> 
>
> Key: BEAM-4046
> URL: https://issues.apache.org/jira/browse/BEAM-4046
> Project: Beam
>  Issue Type: Sub-task
>  Components: build-system
>Reporter: Kenneth Knowles
>Assignee: Michael Luckey
>Priority: Major
>  Time Spent: 30.5h
>  Remaining Estimate: 0h
>
> In our first draft, we had gradle projects like {{":beam-sdks-java-core"}}. 
> It is clumsy and requires a hacky settings.gradle that is not idiomatic.
> In our second draft, we changed them to names that work well with Gradle, 
> like {{":sdks:java:core"}}. This caused Maven artifact IDs to be wonky.
> In our third draft, we regressed to the first draft to get the Maven artifact 
> ids right.
> These should be able to be decoupled. It seems there are many StackOverflow 
> questions on the subject.
> Since it is unidiomatic and a poor user experience, if it does turn out to be 
> mandatory then it needs to be documented inline everywhere - the 
> settings.gradle should say why it is so bizarre, and each build.gradle should 
> indicate what its project id is.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-4046) Decouple gradle project names and maven artifact ids

2019-05-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4046?focusedWorklogId=242099=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-242099
 ]

ASF GitHub Bot logged work on BEAM-4046:


Author: ASF GitHub Bot
Created on: 14/May/19 22:37
Start Date: 14/May/19 22:37
Worklog Time Spent: 10m 
  Work Description: angoenka commented on issue #8581: [BEAM-4046] revert 
test infra to be compatible with pending PRs
URL: https://github.com/apache/beam/pull/8581#issuecomment-492436567
 
 
   Agree, Can we add a jira for tracking.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 242099)
Time Spent: 30h 20m  (was: 30h 10m)

> Decouple gradle project names and maven artifact ids
> 
>
> Key: BEAM-4046
> URL: https://issues.apache.org/jira/browse/BEAM-4046
> Project: Beam
>  Issue Type: Sub-task
>  Components: build-system
>Reporter: Kenneth Knowles
>Assignee: Michael Luckey
>Priority: Major
>  Time Spent: 30h 20m
>  Remaining Estimate: 0h
>
> In our first draft, we had gradle projects like {{":beam-sdks-java-core"}}. 
> It is clumsy and requires a hacky settings.gradle that is not idiomatic.
> In our second draft, we changed them to names that work well with Gradle, 
> like {{":sdks:java:core"}}. This caused Maven artifact IDs to be wonky.
> In our third draft, we regressed to the first draft to get the Maven artifact 
> ids right.
> These should be able to be decoupled. It seems there are many StackOverflow 
> questions on the subject.
> Since it is unidiomatic and a poor user experience, if it does turn out to be 
> mandatory then it needs to be documented inline everywhere - the 
> settings.gradle should say why it is so bizarre, and each build.gradle should 
> indicate what its project id is.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-7275) ParDoLifecycleTest flaky on SparkValidatesRunner

2019-05-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7275?focusedWorklogId=242097=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-242097
 ]

ASF GitHub Bot logged work on BEAM-7275:


Author: ASF GitHub Bot
Created on: 14/May/19 22:37
Start Date: 14/May/19 22:37
Worklog Time Spent: 10m 
  Work Description: adude3141 commented on issue #8563: [BEAM-7275] 
ParDoLifeCycleTest: collect lifecycle info for DoFn insta…
URL: https://github.com/apache/beam/pull/8563#issuecomment-492436408
 
 
   Run Dataflow ValidatesRunner
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 242097)
Time Spent: 3h 20m  (was: 3h 10m)

> ParDoLifecycleTest flaky on SparkValidatesRunner
> 
>
> Key: BEAM-7275
> URL: https://issues.apache.org/jira/browse/BEAM-7275
> Project: Beam
>  Issue Type: Bug
>  Components: test-failures
>Reporter: Michael Luckey
>Assignee: Michael Luckey
>Priority: Major
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> Fix of [BEAM-7197|https://issues.apache.org/jira/browse/BEAM-7197] seems to 
> have introduced some 
> [flakyness|https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/buildTimeTrend]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-7275) ParDoLifecycleTest flaky on SparkValidatesRunner

2019-05-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7275?focusedWorklogId=242098=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-242098
 ]

ASF GitHub Bot logged work on BEAM-7275:


Author: ASF GitHub Bot
Created on: 14/May/19 22:37
Start Date: 14/May/19 22:37
Worklog Time Spent: 10m 
  Work Description: adude3141 commented on issue #8563: [BEAM-7275] 
ParDoLifeCycleTest: collect lifecycle info for DoFn insta…
URL: https://github.com/apache/beam/pull/8563#issuecomment-492436468
 
 
   Run Flink ValidatesRunner
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 242098)
Time Spent: 3.5h  (was: 3h 20m)

> ParDoLifecycleTest flaky on SparkValidatesRunner
> 
>
> Key: BEAM-7275
> URL: https://issues.apache.org/jira/browse/BEAM-7275
> Project: Beam
>  Issue Type: Bug
>  Components: test-failures
>Reporter: Michael Luckey
>Assignee: Michael Luckey
>Priority: Major
>  Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> Fix of [BEAM-7197|https://issues.apache.org/jira/browse/BEAM-7197] seems to 
> have introduced some 
> [flakyness|https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/buildTimeTrend]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-7275) ParDoLifecycleTest flaky on SparkValidatesRunner

2019-05-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7275?focusedWorklogId=242095=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-242095
 ]

ASF GitHub Bot logged work on BEAM-7275:


Author: ASF GitHub Bot
Created on: 14/May/19 22:36
Start Date: 14/May/19 22:36
Worklog Time Spent: 10m 
  Work Description: adude3141 commented on issue #8563: [BEAM-7275] 
ParDoLifeCycleTest: collect lifecycle info for DoFn insta…
URL: https://github.com/apache/beam/pull/8563#issuecomment-492436251
 
 
   Run Dataflow PortabilityApi ValidatesRunner
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 242095)
Time Spent: 3h  (was: 2h 50m)

> ParDoLifecycleTest flaky on SparkValidatesRunner
> 
>
> Key: BEAM-7275
> URL: https://issues.apache.org/jira/browse/BEAM-7275
> Project: Beam
>  Issue Type: Bug
>  Components: test-failures
>Reporter: Michael Luckey
>Assignee: Michael Luckey
>Priority: Major
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> Fix of [BEAM-7197|https://issues.apache.org/jira/browse/BEAM-7197] seems to 
> have introduced some 
> [flakyness|https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/buildTimeTrend]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-7275) ParDoLifecycleTest flaky on SparkValidatesRunner

2019-05-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7275?focusedWorklogId=242094=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-242094
 ]

ASF GitHub Bot logged work on BEAM-7275:


Author: ASF GitHub Bot
Created on: 14/May/19 22:36
Start Date: 14/May/19 22:36
Worklog Time Spent: 10m 
  Work Description: adude3141 commented on issue #8563: [BEAM-7275] 
ParDoLifeCycleTest: collect lifecycle info for DoFn insta…
URL: https://github.com/apache/beam/pull/8563#issuecomment-492436171
 
 
   Run Spark ValidatesRunner
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 242094)
Time Spent: 2h 50m  (was: 2h 40m)

> ParDoLifecycleTest flaky on SparkValidatesRunner
> 
>
> Key: BEAM-7275
> URL: https://issues.apache.org/jira/browse/BEAM-7275
> Project: Beam
>  Issue Type: Bug
>  Components: test-failures
>Reporter: Michael Luckey
>Assignee: Michael Luckey
>Priority: Major
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> Fix of [BEAM-7197|https://issues.apache.org/jira/browse/BEAM-7197] seems to 
> have introduced some 
> [flakyness|https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/buildTimeTrend]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-7275) ParDoLifecycleTest flaky on SparkValidatesRunner

2019-05-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7275?focusedWorklogId=242096=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-242096
 ]

ASF GitHub Bot logged work on BEAM-7275:


Author: ASF GitHub Bot
Created on: 14/May/19 22:36
Start Date: 14/May/19 22:36
Worklog Time Spent: 10m 
  Work Description: adude3141 commented on issue #8563: [BEAM-7275] 
ParDoLifeCycleTest: collect lifecycle info for DoFn insta…
URL: https://github.com/apache/beam/pull/8563#issuecomment-492436349
 
 
   Run Java Flink PortableValidatesRunner Batch
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 242096)
Time Spent: 3h 10m  (was: 3h)

> ParDoLifecycleTest flaky on SparkValidatesRunner
> 
>
> Key: BEAM-7275
> URL: https://issues.apache.org/jira/browse/BEAM-7275
> Project: Beam
>  Issue Type: Bug
>  Components: test-failures
>Reporter: Michael Luckey
>Assignee: Michael Luckey
>Priority: Major
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> Fix of [BEAM-7197|https://issues.apache.org/jira/browse/BEAM-7197] seems to 
> have introduced some 
> [flakyness|https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/buildTimeTrend]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-7275) ParDoLifecycleTest flaky on SparkValidatesRunner

2019-05-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7275?focusedWorklogId=242091=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-242091
 ]

ASF GitHub Bot logged work on BEAM-7275:


Author: ASF GitHub Bot
Created on: 14/May/19 22:32
Start Date: 14/May/19 22:32
Worklog Time Spent: 10m 
  Work Description: adude3141 commented on issue #8563: [BEAM-7275] 
ParDoLifeCycleTest: collect lifecycle info for DoFn insta…
URL: https://github.com/apache/beam/pull/8563#issuecomment-492435029
 
 
   Run Spark ValidatesRunner
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 242091)
Time Spent: 2h 40m  (was: 2.5h)

> ParDoLifecycleTest flaky on SparkValidatesRunner
> 
>
> Key: BEAM-7275
> URL: https://issues.apache.org/jira/browse/BEAM-7275
> Project: Beam
>  Issue Type: Bug
>  Components: test-failures
>Reporter: Michael Luckey
>Assignee: Michael Luckey
>Priority: Major
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> Fix of [BEAM-7197|https://issues.apache.org/jira/browse/BEAM-7197] seems to 
> have introduced some 
> [flakyness|https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/buildTimeTrend]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-6959) Run Go SDK Post Commit tests against the Flink Runner.

2019-05-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-6959?focusedWorklogId=242092=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-242092
 ]

ASF GitHub Bot logged work on BEAM-6959:


Author: ASF GitHub Bot
Created on: 14/May/19 22:32
Start Date: 14/May/19 22:32
Worklog Time Spent: 10m 
  Work Description: ibzib commented on pull request #8531: [BEAM-6959] Add 
Flink tests for Go SDK
URL: https://github.com/apache/beam/pull/8531#discussion_r284025849
 
 

 ##
 File path: sdks/go/test/run_integration_tests.sh
 ##
 @@ -71,21 +130,42 @@ docker images | grep $TAG
 # Push the container
 gcloud docker -- push $CONTAINER
 
-DATAFLOW_WORKER_JAR=$(find 
./runners/google-cloud-dataflow-java/worker/build/libs/beam-runners-google-cloud-dataflow-java-fn-api-worker-*.jar)
-echo "Using Dataflow worker jar: $DATAFLOW_WORKER_JAR"
+if [[ "$RUNNER" == "dataflow" ]]; then
+  if [[ -z "$DATAFLOW_WORKER_JAR" ]]; then
+DATAFLOW_WORKER_JAR=$(find 
./runners/google-cloud-dataflow-java/worker/build/libs/beam-runners-google-cloud-dataflow-java-fn-api-worker-*.jar)
+  fi
+  echo "Using Dataflow worker jar: $DATAFLOW_WORKER_JAR"
+elif [[ "$RUNNER" == "flink" ]]; then
+  if [[ -z "$ENDPOINT" ]]; then
+JOB_PORT=$(python ./sdks/python/apache_beam/utils/ports.py)
 
 Review comment:
   I think I am going to implement the TODO [1] and have the job server process 
pipe its dynamically chosen port back to the test runner.
   
   [1] 
https://github.com/apache/beam/blob/7d6b7b89f5832d5d375059f9c1980181b443835b/sdks/python/apache_beam/runners/portability/portable_runner_test.py#L102-L103
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 242092)
Time Spent: 1h 20m  (was: 1h 10m)

> Run Go SDK  Post Commit tests against the Flink Runner.
> ---
>
> Key: BEAM-6959
> URL: https://issues.apache.org/jira/browse/BEAM-6959
> Project: Beam
>  Issue Type: Sub-task
>  Components: runner-flink, sdk-go, testing
>Reporter: Robert Burke
>Assignee: Kyle Weaver
>Priority: Minor
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> See parent task BEAM-6958



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-7275) ParDoLifecycleTest flaky on SparkValidatesRunner

2019-05-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7275?focusedWorklogId=242090=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-242090
 ]

ASF GitHub Bot logged work on BEAM-7275:


Author: ASF GitHub Bot
Created on: 14/May/19 22:31
Start Date: 14/May/19 22:31
Worklog Time Spent: 10m 
  Work Description: adude3141 commented on issue #8563: [BEAM-7275] 
ParDoLifeCycleTest: collect lifecycle info for DoFn insta…
URL: https://github.com/apache/beam/pull/8563#issuecomment-492434857
 
 
   retest this please
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 242090)
Time Spent: 2.5h  (was: 2h 20m)

> ParDoLifecycleTest flaky on SparkValidatesRunner
> 
>
> Key: BEAM-7275
> URL: https://issues.apache.org/jira/browse/BEAM-7275
> Project: Beam
>  Issue Type: Bug
>  Components: test-failures
>Reporter: Michael Luckey
>Assignee: Michael Luckey
>Priority: Major
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> Fix of [BEAM-7197|https://issues.apache.org/jira/browse/BEAM-7197] seems to 
> have introduced some 
> [flakyness|https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/buildTimeTrend]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-6138) Add User Metric Support to Java SDK

2019-05-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-6138?focusedWorklogId=242087=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-242087
 ]

ASF GitHub Bot logged work on BEAM-6138:


Author: ASF GitHub Bot
Created on: 14/May/19 22:30
Start Date: 14/May/19 22:30
Worklog Time Spent: 10m 
  Work Description: ajamato commented on issue #8280: [BEAM-6138] Update 
java SDK to report user distribution tuple metrics over the FN API
URL: https://github.com/apache/beam/pull/8280#issuecomment-492434483
 
 
Run Python PreCommit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 242087)
Time Spent: 14h  (was: 13h 50m)

> Add User Metric Support to Java SDK
> ---
>
> Key: BEAM-6138
> URL: https://issues.apache.org/jira/browse/BEAM-6138
> Project: Beam
>  Issue Type: New Feature
>  Components: java-fn-execution
>Reporter: Alex Amato
>Assignee: Alex Amato
>Priority: Major
> Fix For: 3.0.0
>
>  Time Spent: 14h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-6138) Add User Metric Support to Java SDK

2019-05-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-6138?focusedWorklogId=242089=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-242089
 ]

ASF GitHub Bot logged work on BEAM-6138:


Author: ASF GitHub Bot
Created on: 14/May/19 22:30
Start Date: 14/May/19 22:30
Worklog Time Spent: 10m 
  Work Description: ajamato commented on issue #8280: [BEAM-6138] Update 
java SDK to report user distribution tuple metrics over the FN API
URL: https://github.com/apache/beam/pull/8280#issuecomment-492434634
 
 
   Run Python PreCommit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 242089)
Time Spent: 14h 10m  (was: 14h)

> Add User Metric Support to Java SDK
> ---
>
> Key: BEAM-6138
> URL: https://issues.apache.org/jira/browse/BEAM-6138
> Project: Beam
>  Issue Type: New Feature
>  Components: java-fn-execution
>Reporter: Alex Amato
>Assignee: Alex Amato
>Priority: Major
> Fix For: 3.0.0
>
>  Time Spent: 14h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-7103) Adding AvroGenericCoder for simple dict type cross-language data transfer

2019-05-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7103?focusedWorklogId=242088=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-242088
 ]

ASF GitHub Bot logged work on BEAM-7103:


Author: ASF GitHub Bot
Created on: 14/May/19 22:30
Start Date: 14/May/19 22:30
Worklog Time Spent: 10m 
  Work Description: chamikaramj commented on pull request #8342: 
[BEAM-7103] Adding AvroGenericCoder for cross-language data transfer
URL: https://github.com/apache/beam/pull/8342#discussion_r284025317
 
 

 ##
 File path: 
sdks/java/core/src/main/java/org/apache/beam/sdk/coders/AvroGenericCoder.java
 ##
 @@ -0,0 +1,32 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.beam.sdk.coders;
+
+import org.apache.avro.Schema;
+import org.apache.avro.generic.GenericRecord;
+
+/** GenericRecord Avro Coder for simple dict type cross-language data 
transfer. */
+public class AvroGenericCoder extends AvroCoder {
+  protected AvroGenericCoder(Schema schema) {
 
 Review comment:
   Also, should we be adding this to standard coder definitions in [1] and [2].
   
   [1] 
https://github.com/apache/beam/blob/master/model/fn-execution/src/main/resources/org/apache/beam/model/fnexecution/v1/standard_coders.yaml
   [2] 
https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/beam_runner_api.proto#L557
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 242088)
Time Spent: 1h 50m  (was: 1h 40m)

> Adding AvroGenericCoder for simple dict type cross-language data transfer
> -
>
> Key: BEAM-7103
> URL: https://issues.apache.org/jira/browse/BEAM-7103
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-java-core, sdk-py-core
>Reporter: Heejong Lee
>Assignee: Heejong Lee
>Priority: Major
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Adding AvroGenericCoder for simple dict type cross-language data transfer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-7103) Adding AvroGenericCoder for simple dict type cross-language data transfer

2019-05-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7103?focusedWorklogId=242085=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-242085
 ]

ASF GitHub Bot logged work on BEAM-7103:


Author: ASF GitHub Bot
Created on: 14/May/19 22:29
Start Date: 14/May/19 22:29
Worklog Time Spent: 10m 
  Work Description: chamikaramj commented on pull request #8342: 
[BEAM-7103] Adding AvroGenericCoder for cross-language data transfer
URL: https://github.com/apache/beam/pull/8342#discussion_r284016513
 
 

 ##
 File path: sdks/python/apache_beam/coders/avro_generic_coder.py
 ##
 @@ -0,0 +1,99 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""avro generic record coder for simple dict-type x-language data transfer."""
+
+from __future__ import absolute_import
+
+import json
+from io import BytesIO
+
+from fastavro import parse_schema
+from fastavro import schemaless_reader
+from fastavro import schemaless_writer
+
+from apache_beam.coders.coder_impl import SimpleCoderImpl
+from apache_beam.coders.coders import Coder
+from apache_beam.coders.coders import FastCoder
+
+AVRO_GENERIC_CODER_URN = "beam:coder:avro:generic:v1"
+
+__all__ = ['AvroGenericCoder', 'AvroGenericRecord']
+
+
+class AvroGenericCoder(FastCoder):
+  """A coder used for AvroGenericRecord values."""
+
+  def __init__(self, schema):
+self.schema = schema
+
+  def _create_impl(self):
+return AvroGenericCoderImpl(self.schema)
+
+  def is_deterministic(self):
+# TODO: need to confirm if it's deterministic
+return False
 
 Review comment:
   Also, create a JIRA to add a determinism checker to Python SDK.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 242085)
Time Spent: 1h 40m  (was: 1.5h)

> Adding AvroGenericCoder for simple dict type cross-language data transfer
> -
>
> Key: BEAM-7103
> URL: https://issues.apache.org/jira/browse/BEAM-7103
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-java-core, sdk-py-core
>Reporter: Heejong Lee
>Assignee: Heejong Lee
>Priority: Major
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Adding AvroGenericCoder for simple dict type cross-language data transfer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-7103) Adding AvroGenericCoder for simple dict type cross-language data transfer

2019-05-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7103?focusedWorklogId=242086=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-242086
 ]

ASF GitHub Bot logged work on BEAM-7103:


Author: ASF GitHub Bot
Created on: 14/May/19 22:29
Start Date: 14/May/19 22:29
Worklog Time Spent: 10m 
  Work Description: chamikaramj commented on pull request #8342: 
[BEAM-7103] Adding AvroGenericCoder for cross-language data transfer
URL: https://github.com/apache/beam/pull/8342#discussion_r284015793
 
 

 ##
 File path: 
sdks/java/core/src/main/java/org/apache/beam/sdk/coders/AvroGenericCoder.java
 ##
 @@ -0,0 +1,32 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.beam.sdk.coders;
+
+import org.apache.avro.Schema;
+import org.apache.avro.generic.GenericRecord;
+
+/** GenericRecord Avro Coder for simple dict type cross-language data 
transfer. */
 
 Review comment:
   Prob. remove the reference for cross-language since this can be used within 
SDK as well.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 242086)

> Adding AvroGenericCoder for simple dict type cross-language data transfer
> -
>
> Key: BEAM-7103
> URL: https://issues.apache.org/jira/browse/BEAM-7103
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-java-core, sdk-py-core
>Reporter: Heejong Lee
>Assignee: Heejong Lee
>Priority: Major
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Adding AvroGenericCoder for simple dict type cross-language data transfer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-4046) Decouple gradle project names and maven artifact ids

2019-05-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4046?focusedWorklogId=242084=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-242084
 ]

ASF GitHub Bot logged work on BEAM-4046:


Author: ASF GitHub Bot
Created on: 14/May/19 22:27
Start Date: 14/May/19 22:27
Worklog Time Spent: 10m 
  Work Description: adude3141 commented on issue #8581: [BEAM-4046] revert 
test infra to be compatible with pending PRs
URL: https://github.com/apache/beam/pull/8581#issuecomment-492433780
 
 
   BTW, we should revert this revert right *after* release is done.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 242084)
Time Spent: 30h 10m  (was: 30h)

> Decouple gradle project names and maven artifact ids
> 
>
> Key: BEAM-4046
> URL: https://issues.apache.org/jira/browse/BEAM-4046
> Project: Beam
>  Issue Type: Sub-task
>  Components: build-system
>Reporter: Kenneth Knowles
>Assignee: Michael Luckey
>Priority: Major
>  Time Spent: 30h 10m
>  Remaining Estimate: 0h
>
> In our first draft, we had gradle projects like {{":beam-sdks-java-core"}}. 
> It is clumsy and requires a hacky settings.gradle that is not idiomatic.
> In our second draft, we changed them to names that work well with Gradle, 
> like {{":sdks:java:core"}}. This caused Maven artifact IDs to be wonky.
> In our third draft, we regressed to the first draft to get the Maven artifact 
> ids right.
> These should be able to be decoupled. It seems there are many StackOverflow 
> questions on the subject.
> Since it is unidiomatic and a poor user experience, if it does turn out to be 
> mandatory then it needs to be documented inline everywhere - the 
> settings.gradle should say why it is so bizarre, and each build.gradle should 
> indicate what its project id is.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-4046) Decouple gradle project names and maven artifact ids

2019-05-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4046?focusedWorklogId=242082=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-242082
 ]

ASF GitHub Bot logged work on BEAM-4046:


Author: ASF GitHub Bot
Created on: 14/May/19 22:25
Start Date: 14/May/19 22:25
Worklog Time Spent: 10m 
  Work Description: angoenka commented on issue #8581: [BEAM-4046] revert 
test infra to be compatible with pending PRs
URL: https://github.com/apache/beam/pull/8581#issuecomment-492433081
 
 
   Run Samza ValidatesRunner
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 242082)
Time Spent: 30h  (was: 29h 50m)

> Decouple gradle project names and maven artifact ids
> 
>
> Key: BEAM-4046
> URL: https://issues.apache.org/jira/browse/BEAM-4046
> Project: Beam
>  Issue Type: Sub-task
>  Components: build-system
>Reporter: Kenneth Knowles
>Assignee: Michael Luckey
>Priority: Major
>  Time Spent: 30h
>  Remaining Estimate: 0h
>
> In our first draft, we had gradle projects like {{":beam-sdks-java-core"}}. 
> It is clumsy and requires a hacky settings.gradle that is not idiomatic.
> In our second draft, we changed them to names that work well with Gradle, 
> like {{":sdks:java:core"}}. This caused Maven artifact IDs to be wonky.
> In our third draft, we regressed to the first draft to get the Maven artifact 
> ids right.
> These should be able to be decoupled. It seems there are many StackOverflow 
> questions on the subject.
> Since it is unidiomatic and a poor user experience, if it does turn out to be 
> mandatory then it needs to be documented inline everywhere - the 
> settings.gradle should say why it is so bizarre, and each build.gradle should 
> indicate what its project id is.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-4046) Decouple gradle project names and maven artifact ids

2019-05-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4046?focusedWorklogId=242080=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-242080
 ]

ASF GitHub Bot logged work on BEAM-4046:


Author: ASF GitHub Bot
Created on: 14/May/19 22:21
Start Date: 14/May/19 22:21
Worklog Time Spent: 10m 
  Work Description: adude3141 commented on issue #8581: [BEAM-4046] revert 
test infra to be compatible with pending PRs
URL: https://github.com/apache/beam/pull/8581#issuecomment-492432133
 
 
   Decided to split commit, as it requires syntax fixes for 
https://github.com/apache/beam/commit/7e3559bbebc0c62eaec4dd21ce065b3a4f1a63fc#diff-10cb71de0a98f02f99aa3869117ab873
   which we need to keep on reverting this revert.
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 242080)
Time Spent: 29h 40m  (was: 29.5h)

> Decouple gradle project names and maven artifact ids
> 
>
> Key: BEAM-4046
> URL: https://issues.apache.org/jira/browse/BEAM-4046
> Project: Beam
>  Issue Type: Sub-task
>  Components: build-system
>Reporter: Kenneth Knowles
>Assignee: Michael Luckey
>Priority: Major
>  Time Spent: 29h 40m
>  Remaining Estimate: 0h
>
> In our first draft, we had gradle projects like {{":beam-sdks-java-core"}}. 
> It is clumsy and requires a hacky settings.gradle that is not idiomatic.
> In our second draft, we changed them to names that work well with Gradle, 
> like {{":sdks:java:core"}}. This caused Maven artifact IDs to be wonky.
> In our third draft, we regressed to the first draft to get the Maven artifact 
> ids right.
> These should be able to be decoupled. It seems there are many StackOverflow 
> questions on the subject.
> Since it is unidiomatic and a poor user experience, if it does turn out to be 
> mandatory then it needs to be documented inline everywhere - the 
> settings.gradle should say why it is so bizarre, and each build.gradle should 
> indicate what its project id is.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-4046) Decouple gradle project names and maven artifact ids

2019-05-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4046?focusedWorklogId=242081=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-242081
 ]

ASF GitHub Bot logged work on BEAM-4046:


Author: ASF GitHub Bot
Created on: 14/May/19 22:21
Start Date: 14/May/19 22:21
Worklog Time Spent: 10m 
  Work Description: adude3141 commented on issue #8581: [BEAM-4046] revert 
test infra to be compatible with pending PRs
URL: https://github.com/apache/beam/pull/8581#issuecomment-492432181
 
 
   Run Seed Job
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 242081)
Time Spent: 29h 50m  (was: 29h 40m)

> Decouple gradle project names and maven artifact ids
> 
>
> Key: BEAM-4046
> URL: https://issues.apache.org/jira/browse/BEAM-4046
> Project: Beam
>  Issue Type: Sub-task
>  Components: build-system
>Reporter: Kenneth Knowles
>Assignee: Michael Luckey
>Priority: Major
>  Time Spent: 29h 50m
>  Remaining Estimate: 0h
>
> In our first draft, we had gradle projects like {{":beam-sdks-java-core"}}. 
> It is clumsy and requires a hacky settings.gradle that is not idiomatic.
> In our second draft, we changed them to names that work well with Gradle, 
> like {{":sdks:java:core"}}. This caused Maven artifact IDs to be wonky.
> In our third draft, we regressed to the first draft to get the Maven artifact 
> ids right.
> These should be able to be decoupled. It seems there are many StackOverflow 
> questions on the subject.
> Since it is unidiomatic and a poor user experience, if it does turn out to be 
> mandatory then it needs to be documented inline everywhere - the 
> settings.gradle should say why it is so bizarre, and each build.gradle should 
> indicate what its project id is.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-7131) Spark portable runner appears to be repeating work (in TFX example)

2019-05-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7131?focusedWorklogId=242078=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-242078
 ]

ASF GitHub Bot logged work on BEAM-7131:


Author: ASF GitHub Bot
Created on: 14/May/19 22:18
Start Date: 14/May/19 22:18
Worklog Time Spent: 10m 
  Work Description: ibzib commented on pull request #8558: [BEAM-7131] 
Spark: cache executable stage output to prevent re-computation
URL: https://github.com/apache/beam/pull/8558#discussion_r284022196
 
 

 ##
 File path: 
runners/spark/src/main/java/org/apache/beam/runners/spark/translation/SparkBatchPortablePipelineTranslator.java
 ##
 @@ -224,6 +224,11 @@ private static void translateImpulse(
 MetricsAccumulator.getInstance());
 JavaRDD staged = inputRdd.mapPartitions(function);
 
+// Prevent potentially expensive re-computation of executable stage
+if (outputs.size() > 1) {
+  staged.cache();
 
 Review comment:
   I went for caching all the intermediate RDDs (can be turned off by 
`setCacheDisabled(true)`). PTAL
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 242078)
Time Spent: 2h 40m  (was: 2.5h)

> Spark portable runner appears to be repeating work (in TFX example)
> ---
>
> Key: BEAM-7131
> URL: https://issues.apache.org/jira/browse/BEAM-7131
> Project: Beam
>  Issue Type: Bug
>  Components: runner-spark
>Reporter: Kyle Weaver
>Assignee: Kyle Weaver
>Priority: Major
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> I've been trying to run the TFX Chicago taxi example [1] on the Spark 
> portable runner. TFDV works fine, but the preprocess step 
> (preprocess_flink.sh [2]) fails with the following error:
> RuntimeError: AlreadyExistsError: file already exists [while running 
> 'WriteTransformFn/WriteTransformFn']
> Assets are being written multiple times to different temp directories, which 
> is okay, but the error occurs when they are copied to the same permanent 
> output directory. Specifically, the copy tree operation in transform_fn_io.py 
> [3] is run twice with the same output directory. The error doesn't occur when 
> that code is modified to allow overwriting existing files, but that's only a 
> shallow fix. While the TF transform should probably be made idempotent, this 
> is also an issue with the Spark runner, which shouldn't be repeating work 
> like this regularly (in the absence of a failure condition).
> [1] [https://github.com/tensorflow/tfx/tree/master/tfx/examples/chicago_taxi]
> [2] 
> [https://github.com/tensorflow/tfx/blob/master/tfx/examples/chicago_taxi/preprocess_flink.sh]
> [3] 
> [https://github.com/tensorflow/transform/blob/master/tensorflow_transform/beam/tft_beam_io/transform_fn_io.py#L33-L45]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-4046) Decouple gradle project names and maven artifact ids

2019-05-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4046?focusedWorklogId=242077=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-242077
 ]

ASF GitHub Bot logged work on BEAM-4046:


Author: ASF GitHub Bot
Created on: 14/May/19 22:15
Start Date: 14/May/19 22:15
Worklog Time Spent: 10m 
  Work Description: adude3141 commented on issue #8581: [BEAM-4046] revert 
test infra to be compatible with pending PRs
URL: https://github.com/apache/beam/pull/8581#issuecomment-492430436
 
 
   Run Seed Job
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 242077)
Time Spent: 29.5h  (was: 29h 20m)

> Decouple gradle project names and maven artifact ids
> 
>
> Key: BEAM-4046
> URL: https://issues.apache.org/jira/browse/BEAM-4046
> Project: Beam
>  Issue Type: Sub-task
>  Components: build-system
>Reporter: Kenneth Knowles
>Assignee: Michael Luckey
>Priority: Major
>  Time Spent: 29.5h
>  Remaining Estimate: 0h
>
> In our first draft, we had gradle projects like {{":beam-sdks-java-core"}}. 
> It is clumsy and requires a hacky settings.gradle that is not idiomatic.
> In our second draft, we changed them to names that work well with Gradle, 
> like {{":sdks:java:core"}}. This caused Maven artifact IDs to be wonky.
> In our third draft, we regressed to the first draft to get the Maven artifact 
> ids right.
> These should be able to be decoupled. It seems there are many StackOverflow 
> questions on the subject.
> Since it is unidiomatic and a poor user experience, if it does turn out to be 
> mandatory then it needs to be documented inline everywhere - the 
> settings.gradle should say why it is so bizarre, and each build.gradle should 
> indicate what its project id is.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-4046) Decouple gradle project names and maven artifact ids

2019-05-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4046?focusedWorklogId=242075=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-242075
 ]

ASF GitHub Bot logged work on BEAM-4046:


Author: ASF GitHub Bot
Created on: 14/May/19 22:03
Start Date: 14/May/19 22:03
Worklog Time Spent: 10m 
  Work Description: angoenka commented on issue #8581: [BEAM-4046] revert 
test infra to be compatible with pending PRs
URL: https://github.com/apache/beam/pull/8581#issuecomment-492426924
 
 
   Run Seed Job
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 242075)
Time Spent: 29h 20m  (was: 29h 10m)

> Decouple gradle project names and maven artifact ids
> 
>
> Key: BEAM-4046
> URL: https://issues.apache.org/jira/browse/BEAM-4046
> Project: Beam
>  Issue Type: Sub-task
>  Components: build-system
>Reporter: Kenneth Knowles
>Assignee: Michael Luckey
>Priority: Major
>  Time Spent: 29h 20m
>  Remaining Estimate: 0h
>
> In our first draft, we had gradle projects like {{":beam-sdks-java-core"}}. 
> It is clumsy and requires a hacky settings.gradle that is not idiomatic.
> In our second draft, we changed them to names that work well with Gradle, 
> like {{":sdks:java:core"}}. This caused Maven artifact IDs to be wonky.
> In our third draft, we regressed to the first draft to get the Maven artifact 
> ids right.
> These should be able to be decoupled. It seems there are many StackOverflow 
> questions on the subject.
> Since it is unidiomatic and a poor user experience, if it does turn out to be 
> mandatory then it needs to be documented inline everywhere - the 
> settings.gradle should say why it is so bizarre, and each build.gradle should 
> indicate what its project id is.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-4046) Decouple gradle project names and maven artifact ids

2019-05-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4046?focusedWorklogId=242073=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-242073
 ]

ASF GitHub Bot logged work on BEAM-4046:


Author: ASF GitHub Bot
Created on: 14/May/19 21:46
Start Date: 14/May/19 21:46
Worklog Time Spent: 10m 
  Work Description: adude3141 commented on issue #8581: [BEAM-4046] revert 
test infra to be compatible with pending PRs
URL: https://github.com/apache/beam/pull/8581#issuecomment-492422053
 
 
   R: @angoenka 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 242073)
Time Spent: 29h 10m  (was: 29h)

> Decouple gradle project names and maven artifact ids
> 
>
> Key: BEAM-4046
> URL: https://issues.apache.org/jira/browse/BEAM-4046
> Project: Beam
>  Issue Type: Sub-task
>  Components: build-system
>Reporter: Kenneth Knowles
>Assignee: Michael Luckey
>Priority: Major
>  Time Spent: 29h 10m
>  Remaining Estimate: 0h
>
> In our first draft, we had gradle projects like {{":beam-sdks-java-core"}}. 
> It is clumsy and requires a hacky settings.gradle that is not idiomatic.
> In our second draft, we changed them to names that work well with Gradle, 
> like {{":sdks:java:core"}}. This caused Maven artifact IDs to be wonky.
> In our third draft, we regressed to the first draft to get the Maven artifact 
> ids right.
> These should be able to be decoupled. It seems there are many StackOverflow 
> questions on the subject.
> Since it is unidiomatic and a poor user experience, if it does turn out to be 
> mandatory then it needs to be documented inline everywhere - the 
> settings.gradle should say why it is so bizarre, and each build.gradle should 
> indicate what its project id is.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-7043) Add DynamoDBIO

2019-05-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7043?focusedWorklogId=242072=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-242072
 ]

ASF GitHub Bot logged work on BEAM-7043:


Author: ASF GitHub Bot
Created on: 14/May/19 21:45
Start Date: 14/May/19 21:45
Worklog Time Spent: 10m 
  Work Description: lukecwik commented on pull request #8390:  [BEAM-7043] 
Add DynamoDBIO
URL: https://github.com/apache/beam/pull/8390#discussion_r284013439
 
 

 ##
 File path: 
sdks/java/io/amazon-web-services/src/main/java/org/apache/beam/sdk/io/aws/dynamodb/AttributeValueCoder.java
 ##
 @@ -0,0 +1,166 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.beam.sdk.io.aws.dynamodb;
+
+import com.amazonaws.services.dynamodbv2.model.AttributeValue;
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.OutputStream;
+import java.nio.ByteBuffer;
+import java.util.List;
+import java.util.stream.Collectors;
+import org.apache.beam.sdk.coders.AtomicCoder;
+import org.apache.beam.sdk.coders.BooleanCoder;
+import org.apache.beam.sdk.coders.ByteArrayCoder;
+import org.apache.beam.sdk.coders.Coder;
+import org.apache.beam.sdk.coders.CoderException;
+import org.apache.beam.sdk.coders.ListCoder;
+import org.apache.beam.sdk.coders.MapCoder;
+import org.apache.beam.sdk.coders.StringUtf8Coder;
+
+/** A {@link Coder} that serializes and deserializes the {@link 
AttributeValue} objects. */
+public class AttributeValueCoder extends AtomicCoder {
+
+  /** Data type of each value type in AttributeValue object. */
+  private enum AttributeValueType {
+s, // for String
+n, // for Number
+b, // for Byte
+sS, // for List of String
+nS, // for List of Number
+bS, // for List of Byte
+m, // for Map of String and AttributeValue
+l, // for list of AttributeValued
 
 Review comment:
   ```suggestion
   l, // for list of AttributeValue
   ```
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 242072)
Time Spent: 7h 20m  (was: 7h 10m)

> Add DynamoDBIO
> --
>
> Key: BEAM-7043
> URL: https://issues.apache.org/jira/browse/BEAM-7043
> Project: Beam
>  Issue Type: New Feature
>  Components: io-java-aws
>Reporter: Cam Mach
>Assignee: Cam Mach
>Priority: Minor
>  Time Spent: 7h 20m
>  Remaining Estimate: 0h
>
> Currently we don't have any feature to write data to AWS DynamoDB. This 
> feature will enable us to send data to DynamoDB



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-4046) Decouple gradle project names and maven artifact ids

2019-05-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4046?focusedWorklogId=242071=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-242071
 ]

ASF GitHub Bot logged work on BEAM-4046:


Author: ASF GitHub Bot
Created on: 14/May/19 21:44
Start Date: 14/May/19 21:44
Worklog Time Spent: 10m 
  Work Description: adude3141 commented on pull request #8581: [BEAM-4046] 
revert test infra to be compatible with pending PRs
URL: https://github.com/apache/beam/pull/8581
 
 
   **Please** add a meaningful description for your change here
   
   
   
   Thank you for your contribution! Follow this checklist to help us 
incorporate your contribution quickly and easily:
   
- [ ] [**Choose 
reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and 
mention them in a comment (`R: @username`).
- [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in 
ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA 
issue, if applicable. This will automatically link the pull request to the 
issue.
- [ ] If this contribution is large, please file an Apache [Individual 
Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   Post-Commit Tests Status (on master branch)
   

   
   Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark
   --- | --- | --- | --- | --- | --- | --- | ---
   Go | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/)
 | --- | --- | --- | --- | --- | ---
   Java | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)
   Python | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Python_Verify/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python_Verify/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Python3_Verify/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python3_Verify/lastCompletedBuild/)
 | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Py_VR_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Py_VR_Dataflow/lastCompletedBuild/)
  [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Py_ValCont/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Py_ValCont/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PreCommit_Python_PVR_Flink_Cron/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PreCommit_Python_PVR_Flink_Cron/lastCompletedBuild/)
 | --- | --- | ---
   
   Pre-Commit Tests Status (on master branch)
   

   
   --- |Java | Python | Go | Website
   --- | --- | --- | --- | ---
   Non-portable | 

[jira] [Work logged] (BEAM-7131) Spark portable runner appears to be repeating work (in TFX example)

2019-05-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7131?focusedWorklogId=242052=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-242052
 ]

ASF GitHub Bot logged work on BEAM-7131:


Author: ASF GitHub Bot
Created on: 14/May/19 21:12
Start Date: 14/May/19 21:12
Worklog Time Spent: 10m 
  Work Description: angoenka commented on issue #8558: [BEAM-7131] Spark: 
cache executable stage output to prevent re-computation
URL: https://github.com/apache/beam/pull/8558#issuecomment-492411712
 
 
   Run Seed Job
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 242052)
Time Spent: 2h 20m  (was: 2h 10m)

> Spark portable runner appears to be repeating work (in TFX example)
> ---
>
> Key: BEAM-7131
> URL: https://issues.apache.org/jira/browse/BEAM-7131
> Project: Beam
>  Issue Type: Bug
>  Components: runner-spark
>Reporter: Kyle Weaver
>Assignee: Kyle Weaver
>Priority: Major
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> I've been trying to run the TFX Chicago taxi example [1] on the Spark 
> portable runner. TFDV works fine, but the preprocess step 
> (preprocess_flink.sh [2]) fails with the following error:
> RuntimeError: AlreadyExistsError: file already exists [while running 
> 'WriteTransformFn/WriteTransformFn']
> Assets are being written multiple times to different temp directories, which 
> is okay, but the error occurs when they are copied to the same permanent 
> output directory. Specifically, the copy tree operation in transform_fn_io.py 
> [3] is run twice with the same output directory. The error doesn't occur when 
> that code is modified to allow overwriting existing files, but that's only a 
> shallow fix. While the TF transform should probably be made idempotent, this 
> is also an issue with the Spark runner, which shouldn't be repeating work 
> like this regularly (in the absence of a failure condition).
> [1] [https://github.com/tensorflow/tfx/tree/master/tfx/examples/chicago_taxi]
> [2] 
> [https://github.com/tensorflow/tfx/blob/master/tfx/examples/chicago_taxi/preprocess_flink.sh]
> [3] 
> [https://github.com/tensorflow/transform/blob/master/tensorflow_transform/beam/tft_beam_io/transform_fn_io.py#L33-L45]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-7131) Spark portable runner appears to be repeating work (in TFX example)

2019-05-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7131?focusedWorklogId=242049=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-242049
 ]

ASF GitHub Bot logged work on BEAM-7131:


Author: ASF GitHub Bot
Created on: 14/May/19 21:12
Start Date: 14/May/19 21:12
Worklog Time Spent: 10m 
  Work Description: angoenka commented on issue #8558: [BEAM-7131] Spark: 
cache executable stage output to prevent re-computation
URL: https://github.com/apache/beam/pull/8558#issuecomment-492411660
 
 
   Run Seed Job
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 242049)
Time Spent: 1h 50m  (was: 1h 40m)

> Spark portable runner appears to be repeating work (in TFX example)
> ---
>
> Key: BEAM-7131
> URL: https://issues.apache.org/jira/browse/BEAM-7131
> Project: Beam
>  Issue Type: Bug
>  Components: runner-spark
>Reporter: Kyle Weaver
>Assignee: Kyle Weaver
>Priority: Major
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> I've been trying to run the TFX Chicago taxi example [1] on the Spark 
> portable runner. TFDV works fine, but the preprocess step 
> (preprocess_flink.sh [2]) fails with the following error:
> RuntimeError: AlreadyExistsError: file already exists [while running 
> 'WriteTransformFn/WriteTransformFn']
> Assets are being written multiple times to different temp directories, which 
> is okay, but the error occurs when they are copied to the same permanent 
> output directory. Specifically, the copy tree operation in transform_fn_io.py 
> [3] is run twice with the same output directory. The error doesn't occur when 
> that code is modified to allow overwriting existing files, but that's only a 
> shallow fix. While the TF transform should probably be made idempotent, this 
> is also an issue with the Spark runner, which shouldn't be repeating work 
> like this regularly (in the absence of a failure condition).
> [1] [https://github.com/tensorflow/tfx/tree/master/tfx/examples/chicago_taxi]
> [2] 
> [https://github.com/tensorflow/tfx/blob/master/tfx/examples/chicago_taxi/preprocess_flink.sh]
> [3] 
> [https://github.com/tensorflow/transform/blob/master/tensorflow_transform/beam/tft_beam_io/transform_fn_io.py#L33-L45]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-7131) Spark portable runner appears to be repeating work (in TFX example)

2019-05-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7131?focusedWorklogId=242050=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-242050
 ]

ASF GitHub Bot logged work on BEAM-7131:


Author: ASF GitHub Bot
Created on: 14/May/19 21:12
Start Date: 14/May/19 21:12
Worklog Time Spent: 10m 
  Work Description: angoenka commented on pull request #8558: [BEAM-7131] 
Spark: cache executable stage output to prevent re-computation
URL: https://github.com/apache/beam/pull/8558
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 242050)
Time Spent: 2h  (was: 1h 50m)

> Spark portable runner appears to be repeating work (in TFX example)
> ---
>
> Key: BEAM-7131
> URL: https://issues.apache.org/jira/browse/BEAM-7131
> Project: Beam
>  Issue Type: Bug
>  Components: runner-spark
>Reporter: Kyle Weaver
>Assignee: Kyle Weaver
>Priority: Major
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> I've been trying to run the TFX Chicago taxi example [1] on the Spark 
> portable runner. TFDV works fine, but the preprocess step 
> (preprocess_flink.sh [2]) fails with the following error:
> RuntimeError: AlreadyExistsError: file already exists [while running 
> 'WriteTransformFn/WriteTransformFn']
> Assets are being written multiple times to different temp directories, which 
> is okay, but the error occurs when they are copied to the same permanent 
> output directory. Specifically, the copy tree operation in transform_fn_io.py 
> [3] is run twice with the same output directory. The error doesn't occur when 
> that code is modified to allow overwriting existing files, but that's only a 
> shallow fix. While the TF transform should probably be made idempotent, this 
> is also an issue with the Spark runner, which shouldn't be repeating work 
> like this regularly (in the absence of a failure condition).
> [1] [https://github.com/tensorflow/tfx/tree/master/tfx/examples/chicago_taxi]
> [2] 
> [https://github.com/tensorflow/tfx/blob/master/tfx/examples/chicago_taxi/preprocess_flink.sh]
> [3] 
> [https://github.com/tensorflow/transform/blob/master/tensorflow_transform/beam/tft_beam_io/transform_fn_io.py#L33-L45]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-7131) Spark portable runner appears to be repeating work (in TFX example)

2019-05-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7131?focusedWorklogId=242051=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-242051
 ]

ASF GitHub Bot logged work on BEAM-7131:


Author: ASF GitHub Bot
Created on: 14/May/19 21:12
Start Date: 14/May/19 21:12
Worklog Time Spent: 10m 
  Work Description: ibzib commented on pull request #8558: [BEAM-7131] 
Spark: cache executable stage output to prevent re-computation
URL: https://github.com/apache/beam/pull/8558
 
 
   When an executable stage has more than one consumer, it currently results in 
the executable stage being re-computed for each extra consumer. This behavior 
is undesirable because we are potentially fusing a large amount of work into 
some executable stages, in addition to thorny problems like side effects (see 
the [Jira](https://issues.apache.org/jira/browse/BEAM-7131)) which, though not 
generally encouraged, should not be breaking things so consistently and 
opaquely.
   
   R: @robertwb @angoenka 
   
   Post-Commit Tests Status (on master branch)
   

   
   Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark
   --- | --- | --- | --- | --- | --- | --- | ---
   Go | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/)
 | --- | --- | --- | --- | --- | ---
   Java | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)
   Python | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Python_Verify/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python_Verify/lastCompletedBuild/)[![Build
 
Status](https://builds.apache.org/job/beam_PostCommit_Python3_Verify/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python3_Verify/lastCompletedBuild/)
 | --- | [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Py_VR_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Py_VR_Dataflow/lastCompletedBuild/)
  [![Build 
Status](https://builds.apache.org/job/beam_PostCommit_Py_ValCont/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Py_ValCont/lastCompletedBuild/)
 | [![Build 
Status](https://builds.apache.org/job/beam_PreCommit_Python_PVR_Flink_Cron/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PreCommit_Python_PVR_Flink_Cron/lastCompletedBuild/)
 | --- | --- | ---
   
   Pre-Commit Tests Status (on master branch)
   

   
   --- |Java | Python | Go | Website
   --- | --- | --- | --- | ---
   Non-portable | [![Build 
Status](https://builds.apache.org/job/beam_PreCommit_Java_Cron/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PreCommit_Java_Cron/lastCompletedBuild/)
 | [![Build 

[jira] [Work logged] (BEAM-7131) Spark portable runner appears to be repeating work (in TFX example)

2019-05-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7131?focusedWorklogId=242053=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-242053
 ]

ASF GitHub Bot logged work on BEAM-7131:


Author: ASF GitHub Bot
Created on: 14/May/19 21:12
Start Date: 14/May/19 21:12
Worklog Time Spent: 10m 
  Work Description: angoenka commented on issue #8558: [BEAM-7131] Spark: 
cache executable stage output to prevent re-computation
URL: https://github.com/apache/beam/pull/8558#issuecomment-492411854
 
 
   sorry for hijacking your pull request :)
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 242053)
Time Spent: 2.5h  (was: 2h 20m)

> Spark portable runner appears to be repeating work (in TFX example)
> ---
>
> Key: BEAM-7131
> URL: https://issues.apache.org/jira/browse/BEAM-7131
> Project: Beam
>  Issue Type: Bug
>  Components: runner-spark
>Reporter: Kyle Weaver
>Assignee: Kyle Weaver
>Priority: Major
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> I've been trying to run the TFX Chicago taxi example [1] on the Spark 
> portable runner. TFDV works fine, but the preprocess step 
> (preprocess_flink.sh [2]) fails with the following error:
> RuntimeError: AlreadyExistsError: file already exists [while running 
> 'WriteTransformFn/WriteTransformFn']
> Assets are being written multiple times to different temp directories, which 
> is okay, but the error occurs when they are copied to the same permanent 
> output directory. Specifically, the copy tree operation in transform_fn_io.py 
> [3] is run twice with the same output directory. The error doesn't occur when 
> that code is modified to allow overwriting existing files, but that's only a 
> shallow fix. While the TF transform should probably be made idempotent, this 
> is also an issue with the Spark runner, which shouldn't be repeating work 
> like this regularly (in the absence of a failure condition).
> [1] [https://github.com/tensorflow/tfx/tree/master/tfx/examples/chicago_taxi]
> [2] 
> [https://github.com/tensorflow/tfx/blob/master/tfx/examples/chicago_taxi/preprocess_flink.sh]
> [3] 
> [https://github.com/tensorflow/transform/blob/master/tensorflow_transform/beam/tft_beam_io/transform_fn_io.py#L33-L45]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (BEAM-7306) [SQL] Add support for distinct aggregations

2019-05-14 Thread Brian Hulette (JIRA)
Brian Hulette created BEAM-7306:
---

 Summary: [SQL] Add support for distinct aggregations
 Key: BEAM-7306
 URL: https://issues.apache.org/jira/browse/BEAM-7306
 Project: Beam
  Issue Type: New Feature
  Components: dsl-sql
Reporter: Brian Hulette


Currently we reject aggregations with the DISTINCT flag set: 
[https://github.com/apache/beam/pull/8498]

We should provide support for these aggregations in a scalable way. See the ML 
discussion on this topic here: 
[https://lists.apache.org/thread.html/24081b0d0b7f9709a5c0f574149fb6b9e9759cba06734200cf3810bf@%3Cdev.beam.apache.org%3E]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-2888) Runner Comparison / Capability Matrix revamp

2019-05-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-2888?focusedWorklogId=242041=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-242041
 ]

ASF GitHub Bot logged work on BEAM-2888:


Author: ASF GitHub Bot
Created on: 14/May/19 21:03
Start Date: 14/May/19 21:03
Worklog Time Spent: 10m 
  Work Description: iemejia commented on pull request #8576: [BEAM-2888] 
Add the not-yet-fully-designed drain and checkpoint to runner comparison
URL: https://github.com/apache/beam/pull/8576#discussion_r283974250
 
 

 ##
 File path: website/src/_data/capability-matrix.yml
 ##
 @@ -1367,3 +1367,102 @@ categories:
 l1: 'No'
 l2: pending model support
 l3: ''
+  - description: Additional features
+anchor: misc
+color-b: 'aaa'
+color-y: 'bbb'
+color-p: 'ccc'
+color-n: 'ddd'
+rows:
+  - name: Drain
+values:
+  - class: model
+l1: 'Partially'
+l2: 
+l3: APIs and semantics for draining a pipeline are under 
discussion. This would cause incomplete aggregations to be emitted regardless 
of trigger and tagged with metadata indicating it is incomplated.
+  - class: dataflow
+l1: 'Partially'
+l2: 
+l3: Dataflow has a native drain operation, but it does not work in 
the presence of event time timer loops. Final implemention pending model 
support.
+  - class: flink
+l1: 
+l2: 
+l3: 
+  - class: spark
+l1: 
+l2: 
+l3: 
+  - class: apex
+l1: 
+l2: 
+l3: 
+  - class: gearpump
+l1: 
+l2: 
+l3: 
+  - class: mapreduce
+l1: 
+l2: 
+l3: 
+  - class: jstorm
+l1:
+l2: 
+l3: 
+  - class: ibmstreams
+l1: 
+l2: 
+l3: 
+  - class: samza
+l1: 
+l2: 
+l3: 
+  - class: nemo
+l1: 
+l2: 
+l3:
+  - name: Checkpoint
+values:
+  - class: model
+l1: 'Partially'
+l2: 
+l3: APIs and semantics for saving a pipeline checkpoint are under 
discussion. This would be a runner-specific materialization of the pipeline 
state required to resume or duplicate the pipeline. 
+  - class: dataflow
+l1: 'No'
+l2: 
+l3: 
+  - class: flink
+l1: 'Partially'
+l2: 
+l3: Flink has a native savepoint capability
+  - class: spark
+l1: 
+l2: 
+l3: 
 
 Review comment:
   Spark has a native checkpoint capability
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 242041)
Time Spent: 1.5h  (was: 1h 20m)

> Runner Comparison / Capability Matrix revamp
> 
>
> Key: BEAM-2888
> URL: https://issues.apache.org/jira/browse/BEAM-2888
> Project: Beam
>  Issue Type: Improvement
>  Components: website
>Reporter: Kenneth Knowles
>Assignee: Griselda Cuevas Zambrano
>Priority: Major
>  Labels: gsod, gsod2019
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Discussion: 
> https://lists.apache.org/thread.html/8aff7d70c254356f2dae3109fb605e0b60763602225a877d3dadf8b7@%3Cdev.beam.apache.org%3E
> Summarizing that discussion, we have a lot of issues/wishes. Some can be 
> addressed as one-off and some need a unified reorganization of the runner 
> comparison.
> Basic corrections:
>  - Remove rows that impossible to not support (ParDo)
>  - Remove rows where "support" doesn't really make sense (Composite 
> transforms)
>  - Deduplicate rows are actually the same model feature (all non-merging 
> windowing / all merging windowing)
>  - Clearly separate rows that represent optimizations (Combine)
>  - Correct rows in the wrong place (Timers are actually a "what...?" row)
>  - Separate or remove rows have not been designed ([Meta]Data driven 
> triggers, retractions)
>  - Rename rows with names that appear no where else (Timestamp control, which 
> is called a TimestampCombiner in Java)
>  - Switch to a more distinct color scheme for full/partial support (currently 
> just solid/faded colors)
>  - Switch to something clearer than "~" for partial support, 

[jira] [Work logged] (BEAM-7131) Spark portable runner appears to be repeating work (in TFX example)

2019-05-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7131?focusedWorklogId=242040=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-242040
 ]

ASF GitHub Bot logged work on BEAM-7131:


Author: ASF GitHub Bot
Created on: 14/May/19 20:59
Start Date: 14/May/19 20:59
Worklog Time Spent: 10m 
  Work Description: angoenka commented on issue #8558: [BEAM-7131] Spark: 
cache executable stage output to prevent re-computation
URL: https://github.com/apache/beam/pull/8558#issuecomment-492407490
 
 
   Run Samza ValidatesRunner
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 242040)
Time Spent: 1h 40m  (was: 1.5h)

> Spark portable runner appears to be repeating work (in TFX example)
> ---
>
> Key: BEAM-7131
> URL: https://issues.apache.org/jira/browse/BEAM-7131
> Project: Beam
>  Issue Type: Bug
>  Components: runner-spark
>Reporter: Kyle Weaver
>Assignee: Kyle Weaver
>Priority: Major
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> I've been trying to run the TFX Chicago taxi example [1] on the Spark 
> portable runner. TFDV works fine, but the preprocess step 
> (preprocess_flink.sh [2]) fails with the following error:
> RuntimeError: AlreadyExistsError: file already exists [while running 
> 'WriteTransformFn/WriteTransformFn']
> Assets are being written multiple times to different temp directories, which 
> is okay, but the error occurs when they are copied to the same permanent 
> output directory. Specifically, the copy tree operation in transform_fn_io.py 
> [3] is run twice with the same output directory. The error doesn't occur when 
> that code is modified to allow overwriting existing files, but that's only a 
> shallow fix. While the TF transform should probably be made idempotent, this 
> is also an issue with the Spark runner, which shouldn't be repeating work 
> like this regularly (in the absence of a failure condition).
> [1] [https://github.com/tensorflow/tfx/tree/master/tfx/examples/chicago_taxi]
> [2] 
> [https://github.com/tensorflow/tfx/blob/master/tfx/examples/chicago_taxi/preprocess_flink.sh]
> [3] 
> [https://github.com/tensorflow/transform/blob/master/tensorflow_transform/beam/tft_beam_io/transform_fn_io.py#L33-L45]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-4046) Decouple gradle project names and maven artifact ids

2019-05-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4046?focusedWorklogId=242038=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-242038
 ]

ASF GitHub Bot logged work on BEAM-4046:


Author: ASF GitHub Bot
Created on: 14/May/19 20:57
Start Date: 14/May/19 20:57
Worklog Time Spent: 10m 
  Work Description: adude3141 commented on issue #8194: [BEAM-4046] 
decouple gradle project names and maven artifact ids
URL: https://github.com/apache/beam/pull/8194#issuecomment-492406808
 
 
   Wouldn't revert right now, unless it is blocking. Seems to be easy fixes.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 242038)
Time Spent: 28h 50m  (was: 28h 40m)

> Decouple gradle project names and maven artifact ids
> 
>
> Key: BEAM-4046
> URL: https://issues.apache.org/jira/browse/BEAM-4046
> Project: Beam
>  Issue Type: Sub-task
>  Components: build-system
>Reporter: Kenneth Knowles
>Assignee: Michael Luckey
>Priority: Major
>  Time Spent: 28h 50m
>  Remaining Estimate: 0h
>
> In our first draft, we had gradle projects like {{":beam-sdks-java-core"}}. 
> It is clumsy and requires a hacky settings.gradle that is not idiomatic.
> In our second draft, we changed them to names that work well with Gradle, 
> like {{":sdks:java:core"}}. This caused Maven artifact IDs to be wonky.
> In our third draft, we regressed to the first draft to get the Maven artifact 
> ids right.
> These should be able to be decoupled. It seems there are many StackOverflow 
> questions on the subject.
> Since it is unidiomatic and a poor user experience, if it does turn out to be 
> mandatory then it needs to be documented inline everywhere - the 
> settings.gradle should say why it is so bizarre, and each build.gradle should 
> indicate what its project id is.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-6880) Deprecate Java Portable Reference Runner

2019-05-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-6880?focusedWorklogId=242036=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-242036
 ]

ASF GitHub Bot logged work on BEAM-6880:


Author: ASF GitHub Bot
Created on: 14/May/19 20:56
Start Date: 14/May/19 20:56
Worklog Time Spent: 10m 
  Work Description: lukecwik commented on pull request #8380: [BEAM-6880] 
Remove deprecated Reference Runner code.
URL: https://github.com/apache/beam/pull/8380#discussion_r283996078
 
 

 ##
 File path: runners/direct-java/build.gradle
 ##
 @@ -70,7 +69,6 @@ dependencies {
   shadow library.java.args4j
   provided library.java.hamcrest_core
   provided library.java.junit
-  testRuntime project(path: ":sdks:java:harness")
 
 Review comment:
   +1, this was a long time pain point since this was part of the classpath and 
not an arg to TestPipeline that allowed the portable direct runner to launch 
the SDK harness as a separate process.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 242036)
Time Spent: 7h  (was: 6h 50m)

> Deprecate Java Portable Reference Runner
> 
>
> Key: BEAM-6880
> URL: https://issues.apache.org/jira/browse/BEAM-6880
> Project: Beam
>  Issue Type: New Feature
>  Components: runner-direct, test-failures, testing
>Reporter: Mikhail Gryzykhin
>Assignee: Daniel Oliveira
>Priority: Major
>  Time Spent: 7h
>  Remaining Estimate: 0h
>
> This ticket is about deprecating Java Portable Reference runner.
>  
> Discussion is happening in [this 
> thread|[https://lists.apache.org/thread.html/0b68efce9b7f2c5297b32d09e5d903e9b354199fe2ce446fbcd240bc@%3Cdev.beam.apache.org%3E]]
>  
>  
> Current summary is: disable beam_PostCommit_Java_PVR_Reference job.
> Keeping or removing reference runner code is still under discussion. It is 
> suggested to create PR that removes relevant code and start voting there.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-7302) Dependencies are broken in SNAPSHOT pom files

2019-05-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7302?focusedWorklogId=242034=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-242034
 ]

ASF GitHub Bot logged work on BEAM-7302:


Author: ASF GitHub Bot
Created on: 14/May/19 20:55
Start Date: 14/May/19 20:55
Worklog Time Spent: 10m 
  Work Description: adude3141 commented on pull request #8577: [BEAM-7302] 
Fix project dependencies in pom.xml outputs
URL: https://github.com/apache/beam/pull/8577
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 242034)
Time Spent: 40m  (was: 0.5h)

> Dependencies are broken in SNAPSHOT pom files
> -
>
> Key: BEAM-7302
> URL: https://issues.apache.org/jira/browse/BEAM-7302
> Project: Beam
>  Issue Type: Bug
>  Components: build-system
>Affects Versions: 2.14.0
>Reporter: Ismaël Mejía
>Assignee: Michael Luckey
>Priority: Major
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> The generated pom in the SNAPSHOTS repository points to dependencies that 
> don't have the correct name, for example [the beam-sdks-java-core 
> pom|https://repository.apache.org/content/repositories/snapshots/org/apache/beam/beam-sdks-java-core/2.14.0-SNAPSHOT/beam-sdks-java-core-2.14.0-20190514.072148-6.pom]
>  points to
> {code}
> 
>   beam.model
>   pipeline
>   2.14.0-SNAPSHOT
>   compile
> 
> {code}
> but such groupId and artifactId do not exist (and have not existed in the 
> past).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-4046) Decouple gradle project names and maven artifact ids

2019-05-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-4046?focusedWorklogId=242032=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-242032
 ]

ASF GitHub Bot logged work on BEAM-4046:


Author: ASF GitHub Bot
Created on: 14/May/19 20:50
Start Date: 14/May/19 20:50
Worklog Time Spent: 10m 
  Work Description: apilloud commented on issue #8194: [BEAM-4046] decouple 
gradle project names and maven artifact ids
URL: https://github.com/apache/beam/pull/8194#issuecomment-492404735
 
 
   I can't find "Run Seed Job" anywhere in the comments here, which means this 
wasn't actually tested. Reports suggest it broke things. Should we revert?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 242032)
Time Spent: 28h 40m  (was: 28.5h)

> Decouple gradle project names and maven artifact ids
> 
>
> Key: BEAM-4046
> URL: https://issues.apache.org/jira/browse/BEAM-4046
> Project: Beam
>  Issue Type: Sub-task
>  Components: build-system
>Reporter: Kenneth Knowles
>Assignee: Michael Luckey
>Priority: Major
>  Time Spent: 28h 40m
>  Remaining Estimate: 0h
>
> In our first draft, we had gradle projects like {{":beam-sdks-java-core"}}. 
> It is clumsy and requires a hacky settings.gradle that is not idiomatic.
> In our second draft, we changed them to names that work well with Gradle, 
> like {{":sdks:java:core"}}. This caused Maven artifact IDs to be wonky.
> In our third draft, we regressed to the first draft to get the Maven artifact 
> ids right.
> These should be able to be decoupled. It seems there are many StackOverflow 
> questions on the subject.
> Since it is unidiomatic and a poor user experience, if it does turn out to be 
> mandatory then it needs to be documented inline everywhere - the 
> settings.gradle should say why it is so bizarre, and each build.gradle should 
> indicate what its project id is.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-7302) Dependencies are broken in SNAPSHOT pom files

2019-05-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7302?focusedWorklogId=242033=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-242033
 ]

ASF GitHub Bot logged work on BEAM-7302:


Author: ASF GitHub Bot
Created on: 14/May/19 20:54
Start Date: 14/May/19 20:54
Worklog Time Spent: 10m 
  Work Description: adude3141 commented on issue #8577: [BEAM-7302] Fix 
project dependencies in pom.xml outputs
URL: https://github.com/apache/beam/pull/8577#issuecomment-492405834
 
 
   Seems to still have an issue with 
`generatePomFileForMavenJavaPublication`task, but publication looks fine.
   
   Merging. Thx @kennknowles 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 242033)
Time Spent: 0.5h  (was: 20m)

> Dependencies are broken in SNAPSHOT pom files
> -
>
> Key: BEAM-7302
> URL: https://issues.apache.org/jira/browse/BEAM-7302
> Project: Beam
>  Issue Type: Bug
>  Components: build-system
>Affects Versions: 2.14.0
>Reporter: Ismaël Mejía
>Assignee: Michael Luckey
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The generated pom in the SNAPSHOTS repository points to dependencies that 
> don't have the correct name, for example [the beam-sdks-java-core 
> pom|https://repository.apache.org/content/repositories/snapshots/org/apache/beam/beam-sdks-java-core/2.14.0-SNAPSHOT/beam-sdks-java-core-2.14.0-20190514.072148-6.pom]
>  points to
> {code}
> 
>   beam.model
>   pipeline
>   2.14.0-SNAPSHOT
>   compile
> 
> {code}
> but such groupId and artifactId do not exist (and have not existed in the 
> past).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-6880) Deprecate Java Portable Reference Runner

2019-05-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-6880?focusedWorklogId=242031=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-242031
 ]

ASF GitHub Bot logged work on BEAM-6880:


Author: ASF GitHub Bot
Created on: 14/May/19 20:43
Start Date: 14/May/19 20:43
Worklog Time Spent: 10m 
  Work Description: youngoli commented on issue #8380: [BEAM-6880] Remove 
deprecated Reference Runner code.
URL: https://github.com/apache/beam/pull/8380#issuecomment-492402349
 
 
   The force push was me rebasing the changes onto master, to account for some 
changes to the build.gradle files that were merged.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 242031)
Time Spent: 6h 50m  (was: 6h 40m)

> Deprecate Java Portable Reference Runner
> 
>
> Key: BEAM-6880
> URL: https://issues.apache.org/jira/browse/BEAM-6880
> Project: Beam
>  Issue Type: New Feature
>  Components: runner-direct, test-failures, testing
>Reporter: Mikhail Gryzykhin
>Assignee: Daniel Oliveira
>Priority: Major
>  Time Spent: 6h 50m
>  Remaining Estimate: 0h
>
> This ticket is about deprecating Java Portable Reference runner.
>  
> Discussion is happening in [this 
> thread|[https://lists.apache.org/thread.html/0b68efce9b7f2c5297b32d09e5d903e9b354199fe2ce446fbcd240bc@%3Cdev.beam.apache.org%3E]]
>  
>  
> Current summary is: disable beam_PostCommit_Java_PVR_Reference job.
> Keeping or removing reference runner code is still under discussion. It is 
> suggested to create PR that removes relevant code and start voting there.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (BEAM-7161) filter.Distinct should accept a KV type

2019-05-14 Thread Damien Desfontaines (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Damien Desfontaines reassigned BEAM-7161:
-

Assignee: Damien Desfontaines

> filter.Distinct should accept a KV type
> ---
>
> Key: BEAM-7161
> URL: https://issues.apache.org/jira/browse/BEAM-7161
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-go
>Reporter: Damien Desfontaines
>Assignee: Damien Desfontaines
>Priority: Major
>
> filter.Distinct only works on PCollection. It would be nice if it worked 
> for PCollection> too (returning distinct KV pairs).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-7302) Dependencies are broken in SNAPSHOT pom files

2019-05-14 Thread Michael Luckey (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-7302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16839771#comment-16839771
 ] 

Michael Luckey commented on BEAM-7302:
--

Thx, [~kenn] for jumping in. Agree on the cause. I am actually wondering, how 
this slipped through. Going to merge this to fix the current issue and will 
further look whether that rewriting is still required. 

> Dependencies are broken in SNAPSHOT pom files
> -
>
> Key: BEAM-7302
> URL: https://issues.apache.org/jira/browse/BEAM-7302
> Project: Beam
>  Issue Type: Bug
>  Components: build-system
>Affects Versions: 2.14.0
>Reporter: Ismaël Mejía
>Assignee: Michael Luckey
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The generated pom in the SNAPSHOTS repository points to dependencies that 
> don't have the correct name, for example [the beam-sdks-java-core 
> pom|https://repository.apache.org/content/repositories/snapshots/org/apache/beam/beam-sdks-java-core/2.14.0-SNAPSHOT/beam-sdks-java-core-2.14.0-20190514.072148-6.pom]
>  points to
> {code}
> 
>   beam.model
>   pipeline
>   2.14.0-SNAPSHOT
>   compile
> 
> {code}
> but such groupId and artifactId do not exist (and have not existed in the 
> past).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-7131) Spark portable runner appears to be repeating work (in TFX example)

2019-05-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7131?focusedWorklogId=242029=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-242029
 ]

ASF GitHub Bot logged work on BEAM-7131:


Author: ASF GitHub Bot
Created on: 14/May/19 20:36
Start Date: 14/May/19 20:36
Worklog Time Spent: 10m 
  Work Description: angoenka commented on issue #8558: [BEAM-7131] Spark: 
cache executable stage output to prevent re-computation
URL: https://github.com/apache/beam/pull/8558#issuecomment-492399983
 
 
   Ignore Run Samza ValidatesRunner 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 242029)
Time Spent: 1.5h  (was: 1h 20m)

> Spark portable runner appears to be repeating work (in TFX example)
> ---
>
> Key: BEAM-7131
> URL: https://issues.apache.org/jira/browse/BEAM-7131
> Project: Beam
>  Issue Type: Bug
>  Components: runner-spark
>Reporter: Kyle Weaver
>Assignee: Kyle Weaver
>Priority: Major
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> I've been trying to run the TFX Chicago taxi example [1] on the Spark 
> portable runner. TFDV works fine, but the preprocess step 
> (preprocess_flink.sh [2]) fails with the following error:
> RuntimeError: AlreadyExistsError: file already exists [while running 
> 'WriteTransformFn/WriteTransformFn']
> Assets are being written multiple times to different temp directories, which 
> is okay, but the error occurs when they are copied to the same permanent 
> output directory. Specifically, the copy tree operation in transform_fn_io.py 
> [3] is run twice with the same output directory. The error doesn't occur when 
> that code is modified to allow overwriting existing files, but that's only a 
> shallow fix. While the TF transform should probably be made idempotent, this 
> is also an issue with the Spark runner, which shouldn't be repeating work 
> like this regularly (in the absence of a failure condition).
> [1] [https://github.com/tensorflow/tfx/tree/master/tfx/examples/chicago_taxi]
> [2] 
> [https://github.com/tensorflow/tfx/blob/master/tfx/examples/chicago_taxi/preprocess_flink.sh]
> [3] 
> [https://github.com/tensorflow/transform/blob/master/tensorflow_transform/beam/tft_beam_io/transform_fn_io.py#L33-L45]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-7131) Spark portable runner appears to be repeating work (in TFX example)

2019-05-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7131?focusedWorklogId=242028=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-242028
 ]

ASF GitHub Bot logged work on BEAM-7131:


Author: ASF GitHub Bot
Created on: 14/May/19 20:36
Start Date: 14/May/19 20:36
Worklog Time Spent: 10m 
  Work Description: angoenka commented on issue #8558: [BEAM-7131] Spark: 
cache executable stage output to prevent re-computation
URL: https://github.com/apache/beam/pull/8558#issuecomment-492399928
 
 
   Run Samza ValidatesRunner
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 242028)
Time Spent: 1h 20m  (was: 1h 10m)

> Spark portable runner appears to be repeating work (in TFX example)
> ---
>
> Key: BEAM-7131
> URL: https://issues.apache.org/jira/browse/BEAM-7131
> Project: Beam
>  Issue Type: Bug
>  Components: runner-spark
>Reporter: Kyle Weaver
>Assignee: Kyle Weaver
>Priority: Major
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> I've been trying to run the TFX Chicago taxi example [1] on the Spark 
> portable runner. TFDV works fine, but the preprocess step 
> (preprocess_flink.sh [2]) fails with the following error:
> RuntimeError: AlreadyExistsError: file already exists [while running 
> 'WriteTransformFn/WriteTransformFn']
> Assets are being written multiple times to different temp directories, which 
> is okay, but the error occurs when they are copied to the same permanent 
> output directory. Specifically, the copy tree operation in transform_fn_io.py 
> [3] is run twice with the same output directory. The error doesn't occur when 
> that code is modified to allow overwriting existing files, but that's only a 
> shallow fix. While the TF transform should probably be made idempotent, this 
> is also an issue with the Spark runner, which shouldn't be repeating work 
> like this regularly (in the absence of a failure condition).
> [1] [https://github.com/tensorflow/tfx/tree/master/tfx/examples/chicago_taxi]
> [2] 
> [https://github.com/tensorflow/tfx/blob/master/tfx/examples/chicago_taxi/preprocess_flink.sh]
> [3] 
> [https://github.com/tensorflow/transform/blob/master/tensorflow_transform/beam/tft_beam_io/transform_fn_io.py#L33-L45]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-6493) examples in Kotlin

2019-05-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-6493?focusedWorklogId=242027=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-242027
 ]

ASF GitHub Bot logged work on BEAM-6493:


Author: ASF GitHub Bot
Created on: 14/May/19 20:34
Start Date: 14/May/19 20:34
Worklog Time Spent: 10m 
  Work Description: pabloem commented on pull request #8439: [BEAM-6493] : 
Add cookbook and snippets examples in Kotlin
URL: https://github.com/apache/beam/pull/8439
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 242027)
Time Spent: 12h 50m  (was: 12h 40m)
Remaining Estimate: 492h 10m  (was: 492h 20m)

> examples in Kotlin
> --
>
> Key: BEAM-6493
> URL: https://issues.apache.org/jira/browse/BEAM-6493
> Project: Beam
>  Issue Type: Task
>  Components: examples-java
>Affects Versions: Not applicable
>Reporter: Harshit Dwivedi
>Assignee: Harshit Dwivedi
>Priority: Minor
>  Labels: documentation
> Fix For: Not applicable
>
>   Original Estimate: 504h
>  Time Spent: 12h 50m
>  Remaining Estimate: 492h 10m
>
> I have been using Apache Beam for few of my projects in production since the 
> past 6 months and apart from Java, [Kotlin|https://kotlinlang.org/] also 
> seems to work as well with no issues whatsoever.
> But currently, the Github Repository of Apache Beam contains examples only in 
> Java which might be an issue for other developers who want to use Apache Beam 
> SDK with kotlin as there are no sample resources available.
> That said, I would love to go ahead and add kotlin examples alongside the 
> current java examples in the [Beam 
> repository|https://github.com/apache/beam/tree/master/examples/java].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-7302) Dependencies are broken in SNAPSHOT pom files

2019-05-14 Thread Luke Cwik (JIRA)


[ 
https://issues.apache.org/jira/browse/BEAM-7302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16839762#comment-16839762
 ] 

Luke Cwik commented on BEAM-7302:
-

The dependencies were originally rewritten because the groupId/artifactId were 
incorrect when publishing the pom.xml. Eventually the reason as to why they 
were rewritten was so we could add the exclusions on each dep to prevent things 
like guava-jdk5 from being pulled in transitively. If the publish plugin can 
now handle this, we could remove a bunch of the rewriting logic.

> Dependencies are broken in SNAPSHOT pom files
> -
>
> Key: BEAM-7302
> URL: https://issues.apache.org/jira/browse/BEAM-7302
> Project: Beam
>  Issue Type: Bug
>  Components: build-system
>Affects Versions: 2.14.0
>Reporter: Ismaël Mejía
>Assignee: Michael Luckey
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The generated pom in the SNAPSHOTS repository points to dependencies that 
> don't have the correct name, for example [the beam-sdks-java-core 
> pom|https://repository.apache.org/content/repositories/snapshots/org/apache/beam/beam-sdks-java-core/2.14.0-SNAPSHOT/beam-sdks-java-core-2.14.0-20190514.072148-6.pom]
>  points to
> {code}
> 
>   beam.model
>   pipeline
>   2.14.0-SNAPSHOT
>   compile
> 
> {code}
> but such groupId and artifactId do not exist (and have not existed in the 
> past).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (BEAM-7223) Enable Python3 tests for Flink

2019-05-14 Thread Valentyn Tymofieiev (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Valentyn Tymofieiev updated BEAM-7223:
--
Status: Open  (was: Triage Needed)

> Enable Python3 tests for Flink
> --
>
> Key: BEAM-7223
> URL: https://issues.apache.org/jira/browse/BEAM-7223
> Project: Beam
>  Issue Type: Sub-task
>  Components: runner-flink
>Reporter: Ankur Goenka
>Priority: Major
>
> Add py3 integration tests for Flink



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-7277) Add PostCommit suite for Python 3.7

2019-05-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7277?focusedWorklogId=242021=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-242021
 ]

ASF GitHub Bot logged work on BEAM-7277:


Author: ASF GitHub Bot
Created on: 14/May/19 20:24
Start Date: 14/May/19 20:24
Worklog Time Spent: 10m 
  Work Description: tvalentyn commented on issue #8564: [BEAM-7277] Add 
PostCommit test suite for Python 3.7
URL: https://github.com/apache/beam/pull/8564#issuecomment-492395752
 
 
   There is still: ":sdks:python:test-suites:direct:py37:postCommitIT' not 
found in root project 'beam'."
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 242021)
Time Spent: 3h 20m  (was: 3h 10m)

> Add PostCommit suite for Python 3.7
> ---
>
> Key: BEAM-7277
> URL: https://issues.apache.org/jira/browse/BEAM-7277
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core
> Environment: Python 3.7
>Reporter: Frederik Bode
>Assignee: Frederik Bode
>Priority: Major
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> Add PostCommit suites for Python 3.7.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-7277) Add PostCommit suite for Python 3.7

2019-05-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7277?focusedWorklogId=242023=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-242023
 ]

ASF GitHub Bot logged work on BEAM-7277:


Author: ASF GitHub Bot
Created on: 14/May/19 20:25
Start Date: 14/May/19 20:25
Worklog Time Spent: 10m 
  Work Description: tvalentyn commented on issue #8564: [BEAM-7277] Add 
PostCommit test suite for Python 3.7
URL: https://github.com/apache/beam/pull/8564#issuecomment-492396112
 
 
   Also please add the datastore test to the Direct runner suites on 3.6, 3.7. 
Thanks!
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 242023)
Time Spent: 3h 40m  (was: 3.5h)

> Add PostCommit suite for Python 3.7
> ---
>
> Key: BEAM-7277
> URL: https://issues.apache.org/jira/browse/BEAM-7277
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core
> Environment: Python 3.7
>Reporter: Frederik Bode
>Assignee: Frederik Bode
>Priority: Major
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> Add PostCommit suites for Python 3.7.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-7277) Add PostCommit suite for Python 3.7

2019-05-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7277?focusedWorklogId=242022=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-242022
 ]

ASF GitHub Bot logged work on BEAM-7277:


Author: ASF GitHub Bot
Created on: 14/May/19 20:24
Start Date: 14/May/19 20:24
Worklog Time Spent: 10m 
  Work Description: tvalentyn commented on issue #8564: [BEAM-7277] Add 
PostCommit test suite for Python 3.7
URL: https://github.com/apache/beam/pull/8564#issuecomment-492395752
 
 
   There is still: `:sdks:python:test-suites:direct:py37:postCommitIT' not 
found in root project 'beam'.`
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 242022)
Time Spent: 3.5h  (was: 3h 20m)

> Add PostCommit suite for Python 3.7
> ---
>
> Key: BEAM-7277
> URL: https://issues.apache.org/jira/browse/BEAM-7277
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core
> Environment: Python 3.7
>Reporter: Frederik Bode
>Assignee: Frederik Bode
>Priority: Major
>  Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> Add PostCommit suites for Python 3.7.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (BEAM-7284) Support Py3 Dataclasses

2019-05-14 Thread Valentyn Tymofieiev (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Valentyn Tymofieiev updated BEAM-7284:
--
Status: Open  (was: Triage Needed)

> Support Py3 Dataclasses 
> 
>
> Key: BEAM-7284
> URL: https://issues.apache.org/jira/browse/BEAM-7284
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Priority: Major
>
> It looks like dill does not support Dataclasses yet, 
> https://github.com/uqfoundation/dill/issues/312, which very likely means that 
> Beam does not support them either.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (BEAM-6881) Dataflow worker harness image tag does not match released sdk version for Python 3.6+

2019-05-14 Thread Valentyn Tymofieiev (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-6881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Valentyn Tymofieiev closed BEAM-6881.
-
Resolution: Fixed

> Dataflow worker harness image tag does not match released sdk version for 
> Python 3.6+
> -
>
> Key: BEAM-6881
> URL: https://issues.apache.org/jira/browse/BEAM-6881
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core
>Reporter: Robbe
>Priority: Major
> Fix For: Not applicable
>
>
> The following failure occurs:
> {noformat}
> 14:09:16 FAIL: test_worker_harness_image_tag_matches_released_sdk_version 
> (apache_beam.runners.dataflow.internal.apiclient_test.UtilTest)
>  14:09:16 
> --
>  14:09:16 Traceback (most recent call last):
>  14:09:16 File 
> "/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/sdks/python/target/.tox/py36-gcp/lib/python3.6/site-packages/mock/mock.py",
>  line 1305, in patched
>  14:09:16 return func(*args, **keywargs)
>  14:09:16 File 
> "/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/sdks/python/apache_beam/runners/dataflow/internal/apiclient_test.py",
>  line 352, in test_worker_harness_image_tag_matches_released_sdk_version
>  14:09:16 '/python3-fnapi:2.2.0'))
>  14:09:16 AssertionError: 
> 'gcr.io/cloud-dataflow/v1beta3/python36-fnapi:2.2.0' != 
> 'gcr.io/cloud-dataflow/v1beta3/python3-fnapi:2.2.0'
>  14:09:16 - gcr.io/cloud-dataflow/v1beta3/python36-fnapi:2.2.0
>  14:09:16 ? -
>  14:09:16 + gcr.io/cloud-dataflow/v1beta3/python3-fnapi:2.2.0{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (BEAM-7203) Dataflow runner should set use_fastavro experiment on Python 3.

2019-05-14 Thread Valentyn Tymofieiev (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Valentyn Tymofieiev updated BEAM-7203:
--
Status: Open  (was: Triage Needed)

> Dataflow runner should set use_fastavro experiment on Python 3.
> ---
>
> Key: BEAM-7203
> URL: https://issues.apache.org/jira/browse/BEAM-7203
> Project: Beam
>  Issue Type: Sub-task
>  Components: io-python-avro
>Reporter: Valentyn Tymofieiev
>Assignee: Valentyn Tymofieiev
>Priority: Major
> Fix For: 2.14.0
>
>
> cc: [~frederik] [~Juta] [~chamikara]
> Related:
> - https://github.com/apache/beam/pull/8130
> - https://lists.apache.org/list.html?u...@beam.apache.org:lte=99M:use_fastavro



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (BEAM-7198) Rename ToStringCoder into ToBytesCoder

2019-05-14 Thread Valentyn Tymofieiev (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Valentyn Tymofieiev updated BEAM-7198:
--
Status: Open  (was: Triage Needed)

> Rename ToStringCoder into ToBytesCoder
> --
>
> Key: BEAM-7198
> URL: https://issues.apache.org/jira/browse/BEAM-7198
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Priority: Minor
>
> The name of ToStringCoder class [1] is confusing, since the output of 
> encode() on Python3 will be bytes. On Python 2 the output is also bytes, 
> since bytes and string a synonyms on Py2.
> ToBytesCoder would be a better name for this class. 
> Note that this class is not listed in coders that constitute Public APIs [2], 
> so we can treat this as internal change. As a courtesy to users  who happened 
> to reference a non-public coder in their pipelines we can keep the old class 
> name as an alias, e.g. ToStringCoder = ToBytesCoder to avoid friction, but 
> clean up Beam codeabase to use the new name.
> [1] 
> https://github.com/apache/beam/blob/ef4b2ef7e5fa2fb87e1491df82d2797947f51be9/sdks/python/apache_beam/coders/coders.py#L344
> [2] 
> https://github.com/apache/beam/blob/ef4b2ef7e5fa2fb87e1491df82d2797947f51be9/sdks/python/apache_beam/coders/coders.py#L20
> cc: [~yoshiki.obata] [~chamikara]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (BEAM-7277) Add PostCommit suite for Python 3.7

2019-05-14 Thread Valentyn Tymofieiev (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Valentyn Tymofieiev updated BEAM-7277:
--
Status: Open  (was: Triage Needed)

> Add PostCommit suite for Python 3.7
> ---
>
> Key: BEAM-7277
> URL: https://issues.apache.org/jira/browse/BEAM-7277
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core
> Environment: Python 3.7
>Reporter: Frederik Bode
>Assignee: Frederik Bode
>Priority: Major
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> Add PostCommit suites for Python 3.7.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-6493) examples in Kotlin

2019-05-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-6493?focusedWorklogId=242019=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-242019
 ]

ASF GitHub Bot logged work on BEAM-6493:


Author: ASF GitHub Bot
Created on: 14/May/19 20:21
Start Date: 14/May/19 20:21
Worklog Time Spent: 10m 
  Work Description: harshithdwivedi commented on issue #8439: [BEAM-6493] : 
Add cookbook and snippets examples in Kotlin
URL: https://github.com/apache/beam/pull/8439#issuecomment-492394713
 
 
   Yeah, I agree.
   Let me look into making this work in a new PR.
   For now you can go ahead and merge this one 
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 242019)
Time Spent: 12h 40m  (was: 12.5h)
Remaining Estimate: 492h 20m  (was: 492.5h)

> examples in Kotlin
> --
>
> Key: BEAM-6493
> URL: https://issues.apache.org/jira/browse/BEAM-6493
> Project: Beam
>  Issue Type: Task
>  Components: examples-java
>Affects Versions: Not applicable
>Reporter: Harshit Dwivedi
>Assignee: Harshit Dwivedi
>Priority: Minor
>  Labels: documentation
> Fix For: Not applicable
>
>   Original Estimate: 504h
>  Time Spent: 12h 40m
>  Remaining Estimate: 492h 20m
>
> I have been using Apache Beam for few of my projects in production since the 
> past 6 months and apart from Java, [Kotlin|https://kotlinlang.org/] also 
> seems to work as well with no issues whatsoever.
> But currently, the Github Repository of Apache Beam contains examples only in 
> Java which might be an issue for other developers who want to use Apache Beam 
> SDK with kotlin as there are no sample resources available.
> That said, I would love to go ahead and add kotlin examples alongside the 
> current java examples in the [Beam 
> repository|https://github.com/apache/beam/tree/master/examples/java].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-2888) Runner Comparison / Capability Matrix revamp

2019-05-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-2888?focusedWorklogId=241999=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-241999
 ]

ASF GitHub Bot logged work on BEAM-2888:


Author: ASF GitHub Bot
Created on: 14/May/19 19:58
Start Date: 14/May/19 19:58
Worklog Time Spent: 10m 
  Work Description: iemejia commented on pull request #8576: [BEAM-2888] 
Add the not-yet-fully-designed drain and checkpoint to runner comparison
URL: https://github.com/apache/beam/pull/8576#discussion_r283974250
 
 

 ##
 File path: website/src/_data/capability-matrix.yml
 ##
 @@ -1367,3 +1367,102 @@ categories:
 l1: 'No'
 l2: pending model support
 l3: ''
+  - description: Additional features
+anchor: misc
+color-b: 'aaa'
+color-y: 'bbb'
+color-p: 'ccc'
+color-n: 'ddd'
+rows:
+  - name: Drain
+values:
+  - class: model
+l1: 'Partially'
+l2: 
+l3: APIs and semantics for draining a pipeline are under 
discussion. This would cause incomplete aggregations to be emitted regardless 
of trigger and tagged with metadata indicating it is incomplated.
+  - class: dataflow
+l1: 'Partially'
+l2: 
+l3: Dataflow has a native drain operation, but it does not work in 
the presence of event time timer loops. Final implemention pending model 
support.
+  - class: flink
+l1: 
+l2: 
+l3: 
+  - class: spark
+l1: 
+l2: 
+l3: 
+  - class: apex
+l1: 
+l2: 
+l3: 
+  - class: gearpump
+l1: 
+l2: 
+l3: 
+  - class: mapreduce
+l1: 
+l2: 
+l3: 
+  - class: jstorm
+l1:
+l2: 
+l3: 
+  - class: ibmstreams
+l1: 
+l2: 
+l3: 
+  - class: samza
+l1: 
+l2: 
+l3: 
+  - class: nemo
+l1: 
+l2: 
+l3:
+  - name: Checkpoint
+values:
+  - class: model
+l1: 'Partially'
+l2: 
+l3: APIs and semantics for saving a pipeline checkpoint are under 
discussion. This would be a runner-specific materialization of the pipeline 
state required to resume or duplicate the pipeline. 
+  - class: dataflow
+l1: 'No'
+l2: 
+l3: 
+  - class: flink
+l1: 'Partially'
+l2: 
+l3: Flink has a native savepoint capability
+  - class: spark
+l1: 
+l2: 
+l3: 
 
 Review comment:
   Spark has a native savepoint capability
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 241999)
Time Spent: 1h 10m  (was: 1h)

> Runner Comparison / Capability Matrix revamp
> 
>
> Key: BEAM-2888
> URL: https://issues.apache.org/jira/browse/BEAM-2888
> Project: Beam
>  Issue Type: Improvement
>  Components: website
>Reporter: Kenneth Knowles
>Assignee: Griselda Cuevas Zambrano
>Priority: Major
>  Labels: gsod, gsod2019
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Discussion: 
> https://lists.apache.org/thread.html/8aff7d70c254356f2dae3109fb605e0b60763602225a877d3dadf8b7@%3Cdev.beam.apache.org%3E
> Summarizing that discussion, we have a lot of issues/wishes. Some can be 
> addressed as one-off and some need a unified reorganization of the runner 
> comparison.
> Basic corrections:
>  - Remove rows that impossible to not support (ParDo)
>  - Remove rows where "support" doesn't really make sense (Composite 
> transforms)
>  - Deduplicate rows are actually the same model feature (all non-merging 
> windowing / all merging windowing)
>  - Clearly separate rows that represent optimizations (Combine)
>  - Correct rows in the wrong place (Timers are actually a "what...?" row)
>  - Separate or remove rows have not been designed ([Meta]Data driven 
> triggers, retractions)
>  - Rename rows with names that appear no where else (Timestamp control, which 
> is called a TimestampCombiner in Java)
>  - Switch to a more distinct color scheme for full/partial support (currently 
> just solid/faded colors)
>  - Switch to something clearer than "~" for partial support, versus 

[jira] [Work logged] (BEAM-2888) Runner Comparison / Capability Matrix revamp

2019-05-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-2888?focusedWorklogId=242000=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-242000
 ]

ASF GitHub Bot logged work on BEAM-2888:


Author: ASF GitHub Bot
Created on: 14/May/19 19:58
Start Date: 14/May/19 19:58
Worklog Time Spent: 10m 
  Work Description: iemejia commented on pull request #8576: [BEAM-2888] 
Add the not-yet-fully-designed drain and checkpoint to runner comparison
URL: https://github.com/apache/beam/pull/8576#discussion_r283974106
 
 

 ##
 File path: website/src/_data/capability-matrix.yml
 ##
 @@ -1367,3 +1367,102 @@ categories:
 l1: 'No'
 l2: pending model support
 l3: ''
+  - description: Additional features
+anchor: misc
+color-b: 'aaa'
+color-y: 'bbb'
+color-p: 'ccc'
+color-n: 'ddd'
+rows:
+  - name: Drain
+values:
+  - class: model
+l1: 'Partially'
+l2: 
+l3: APIs and semantics for draining a pipeline are under 
discussion. This would cause incomplete aggregations to be emitted regardless 
of trigger and tagged with metadata indicating it is incomplated.
+  - class: dataflow
+l1: 'Partially'
+l2: 
+l3: Dataflow has a native drain operation, but it does not work in 
the presence of event time timer loops. Final implemention pending model 
support.
+  - class: flink
+l1: 
+l2: 
+l3: 
+  - class: spark
+l1: 
+l2: 
+l3: 
+  - class: apex
+l1: 
+l2: 
+l3: 
+  - class: gearpump
+l1: 
+l2: 
+l3: 
+  - class: mapreduce
+l1: 
+l2: 
+l3: 
+  - class: jstorm
+l1:
+l2: 
+l3: 
+  - class: ibmstreams
+l1: 
+l2: 
+l3: 
+  - class: samza
+l1: 
+l2: 
+l3: 
+  - class: nemo
+l1: 
+l2: 
+l3:
+  - name: Checkpoint
+values:
+  - class: model
+l1: 'Partially'
+l2: 
+l3: APIs and semantics for saving a pipeline checkpoint are under 
discussion. This would be a runner-specific materialization of the pipeline 
state required to resume or duplicate the pipeline. 
+  - class: dataflow
+l1: 'No'
+l2: 
+l3: 
+  - class: flink
+l1: 'Partially'
+l2: 
+l3: Flink has a native savepoint capability
+  - class: spark
+l1: 
 
 Review comment:
   'Partially'
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 242000)
Time Spent: 1h 20m  (was: 1h 10m)

> Runner Comparison / Capability Matrix revamp
> 
>
> Key: BEAM-2888
> URL: https://issues.apache.org/jira/browse/BEAM-2888
> Project: Beam
>  Issue Type: Improvement
>  Components: website
>Reporter: Kenneth Knowles
>Assignee: Griselda Cuevas Zambrano
>Priority: Major
>  Labels: gsod, gsod2019
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Discussion: 
> https://lists.apache.org/thread.html/8aff7d70c254356f2dae3109fb605e0b60763602225a877d3dadf8b7@%3Cdev.beam.apache.org%3E
> Summarizing that discussion, we have a lot of issues/wishes. Some can be 
> addressed as one-off and some need a unified reorganization of the runner 
> comparison.
> Basic corrections:
>  - Remove rows that impossible to not support (ParDo)
>  - Remove rows where "support" doesn't really make sense (Composite 
> transforms)
>  - Deduplicate rows are actually the same model feature (all non-merging 
> windowing / all merging windowing)
>  - Clearly separate rows that represent optimizations (Combine)
>  - Correct rows in the wrong place (Timers are actually a "what...?" row)
>  - Separate or remove rows have not been designed ([Meta]Data driven 
> triggers, retractions)
>  - Rename rows with names that appear no where else (Timestamp control, which 
> is called a TimestampCombiner in Java)
>  - Switch to a more distinct color scheme for full/partial support (currently 
> just solid/faded colors)
>  - Switch to something clearer than "~" for partial support, versus ✘ and ✓ 
> for none and full.
>  - Correct Gearpump support 

[jira] [Work logged] (BEAM-6693) ApproximateUnique transform for Python SDK

2019-05-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-6693?focusedWorklogId=241997=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-241997
 ]

ASF GitHub Bot logged work on BEAM-6693:


Author: ASF GitHub Bot
Created on: 14/May/19 19:55
Start Date: 14/May/19 19:55
Worklog Time Spent: 10m 
  Work Description: pabloem commented on pull request #8535: [BEAM-6693] 
ApproximateUnique transform for Python SDK
URL: https://github.com/apache/beam/pull/8535#discussion_r283973063
 
 

 ##
 File path: sdks/python/apache_beam/transforms/stats.py
 ##
 @@ -0,0 +1,205 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""This module has all statistic related transforms."""
+
+from __future__ import absolute_import
+from __future__ import division
+
+import heapq
+import math
+import sys
+from builtins import round
+
+import mmh3
+
+from apache_beam.transforms.core import *
+from apache_beam.transforms.ptransform import PTransform
+
+__all__ = [
+'ApproximateUniqueGlobally',
+'ApproximateUniquePerKey',
+]
+
+
+class ApproximateUniqueGlobally(PTransform):
+  """
+  Hashes input elements and uses those to extrapolate the size of the entire
+  set of hash values by assuming the rest of the hash values are as densely
+  distributed as the sample space.
+
+  Args:
+**kwargs: Accepts a single named argument "size" or "error".
+size: an int not smaller than 16, which we would use to estimate
+  number of unique values.
+error: max estimation error, which is a float between 0.01
+  and 0.50. If error is given, size will be calculated from error with
+  _get_sample_size_from_est_error function.
+  """
+
+  _NO_VALUE_ERR_MSG = 'Either size or error should be set. Received {}.'
+  _MULTI_VALUE_ERR_MSG = 'Either size or error should be set. ' \
+ 'Received {size = %s, error = %s}.'
+  _INPUT_SIZE_ERR_MSG = 'ApproximateUnique needs a size >= 16 for an error ' \
+'<= 0.50. In general, the estimation error is about ' \
+'2 / sqrt(sample_size). Received {size = %s}.'
+  _INPUT_ERROR_ERR_MSG = 'ApproximateUnique needs an estimation error ' \
+ 'between 0.01 and 0.50. Received {error = %s}.'
+
+  def __init__(self, size=None, error=None):
+
+if None not in (size, error):
+  raise ValueError(self._MULTI_VALUE_ERR_MSG % (size, error))
+elif size is None and error is None:
+  raise ValueError(self._NO_VALUE_ERR_MSG)
+elif size is not None:
+  if not isinstance(size, int) or size < 16:
+raise ValueError(self._INPUT_SIZE_ERR_MSG % (size))
+  else:
+self._sample_size = size
+self._max_est_err = None
+else:
+  if error < 0.01 or error > 0.5:
+raise ValueError(self._INPUT_ERROR_ERR_MSG % (error))
+  else:
+self._sample_size = self._get_sample_size_from_est_error(error)
+self._max_est_err = error
+
+  def expand(self, pcoll):
+return pcoll \
+   | 'CountGlobalUniqueValues' \
+   >> 
(CombineGlobally(ApproximateUniqueCombineDoFn(self._sample_size)))
+
+  @staticmethod
+  def _get_sample_size_from_est_error(est_err):
+"""
+:return: sample size
+
+Calculate sample size from estimation error
+"""
+# math.ceil in python 2.7 returns float, while it returns int in python 3.
+return int(math.ceil(4.0 / math.pow(est_err, 2.0)))
+
+
+class ApproximateUniquePerKey(ApproximateUniqueGlobally):
+
+  def expand(self, pcoll):
+return pcoll \
+   | 'CountPerKeyUniqueValues' \
+   >> (CombinePerKey(ApproximateUniqueCombineDoFn(self._sample_size)))
+
+
+class _LargestUnique(object):
+  """
+  An object to keep samples and calculate sample hash space. It is an
+  accumulator of a combine function.
+  """
+  _HASH_SPACE_SIZE = 2.0 * sys.maxsize
+
+  def __init__(self, sample_size):
+self._sample_size = sample_size
+self._min_hash = sys.maxsize
+self._sample_heap = []
+self._sample_set = set()
+
+  def add(self, element):
+"""
+:param an element from pcoll.
+:return: boolean type whether the value is 

[jira] [Work logged] (BEAM-6693) ApproximateUnique transform for Python SDK

2019-05-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-6693?focusedWorklogId=241994=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-241994
 ]

ASF GitHub Bot logged work on BEAM-6693:


Author: ASF GitHub Bot
Created on: 14/May/19 19:54
Start Date: 14/May/19 19:54
Worklog Time Spent: 10m 
  Work Description: pabloem commented on pull request #8535: [BEAM-6693] 
ApproximateUnique transform for Python SDK
URL: https://github.com/apache/beam/pull/8535#discussion_r283971856
 
 

 ##
 File path: sdks/python/apache_beam/transforms/stats.py
 ##
 @@ -0,0 +1,206 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Core PTransform subclasses, such as FlatMap, GroupByKey, and Map."""
+
+from __future__ import absolute_import
+from __future__ import division
+
+import heapq
+import logging
+import math
+import sys
+from builtins import round
+
+from apache_beam.transforms.core import *
+from apache_beam.transforms.ptransform import PTransform
+
+try:
+  import mmh3
+except ImportError:
+  logging.info('Python version >=3.0 uses buildin hash function.')
+
+__all__ = [
+'ApproximateUniqueGlobally',
+'ApproximateUniquePerKey',
+]
+
+
+class ApproximateUniqueGlobally(PTransform):
+  """
+  Hashes input elements and uses those to extrapolate the size of the entire
+  set of hash values by assuming the rest of the hash values are as densely
+  distributed as the sample space.
+
+  Args:
+**kwargs: Accepts a single named argument "size" or "error".
+size: an int not smaller than 16, which we would use to estimate
+  number of unique values.
+error: max estimation error, which is a float between 0.01
+  and 0.50. If error is given, size will be calculated from error with
+  _get_sample_size_from_est_error function.
+  """
+
+  _NO_VALUE_ERR_MSG = 'Either size or error should be set. Received {}.'
+  _MULTI_VALUE_ERR_MSG = 'Either size or error should be set. ' \
+ 'Received {size = %s, error = %s}.'
+  _INPUT_SIZE_ERR_MSG = 'ApproximateUnique needs a size >= 16 for an error ' \
+'<= 0.50. In general, the estimation error is about ' \
+'2 / sqrt(sample_size). Received {size = %s}.'
+  _INPUT_ERROR_ERR_MSG = 'ApproximateUnique needs an estimation error ' \
+ 'between 0.01 and 0.50. Received {error = %s}.'
+
+  def __init__(self, **kwargs):
+input_size = kwargs.pop('size', None)
+input_err = kwargs.pop('error', None)
+
+if None not in (input_size, input_err):
+  raise ValueError(self._MULTI_VALUE_ERR_MSG % (input_size, input_err))
+elif input_size is None and input_err is None:
+  raise ValueError(self._NO_VALUE_ERR_MSG)
+elif input_size is not None:
+  if not isinstance(input_size, int) or input_size < 16:
+raise ValueError(self._INPUT_SIZE_ERR_MSG % (input_size))
+  else:
+self._sample_size = input_size
+self._max_est_err = None
+else:
+  if input_err < 0.01 or input_err > 0.5:
+raise ValueError(self._INPUT_ERROR_ERR_MSG % (input_err))
+  else:
+self._sample_size = self._get_sample_size_from_est_error(input_err)
+self._max_est_err = input_err
+
+  def expand(self, pcoll):
+return pcoll \
+   | 'CountGlobalUniqueValues' \
+   >> 
(CombineGlobally(ApproximateUniqueCombineDoFn(self._sample_size)))
+
+  @staticmethod
+  def _get_sample_size_from_est_error(est_err):
+"""
+:return: sample size
+
+Calculate sample size from estimation error
+"""
+return int(math.ceil(4.0 / math.pow(est_err, 2.0)))
+
+
+class ApproximateUniquePerKey(ApproximateUniqueGlobally):
+
+  def expand(self, pcoll):
+return pcoll \
+   | 'CountPerKeyUniqueValues' \
+   >> (CombinePerKey(ApproximateUniqueCombineDoFn(self._sample_size)))
+
+
+class _LargestUnique(object):
+  """
+  An object to keep samples and calculate sample hash space. It is an
+  accumulator of a combine function.
+  """
+  _HASH_SPACE_SIZE = 2.0 * sys.maxsize
+
+  def __init__(self, sample_size):
+self._sample_size = sample_size
+

[jira] [Work logged] (BEAM-6693) ApproximateUnique transform for Python SDK

2019-05-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-6693?focusedWorklogId=241992=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-241992
 ]

ASF GitHub Bot logged work on BEAM-6693:


Author: ASF GitHub Bot
Created on: 14/May/19 19:54
Start Date: 14/May/19 19:54
Worklog Time Spent: 10m 
  Work Description: pabloem commented on pull request #8535: [BEAM-6693] 
ApproximateUnique transform for Python SDK
URL: https://github.com/apache/beam/pull/8535#discussion_r283970237
 
 

 ##
 File path: sdks/python/apache_beam/transforms/stats.py
 ##
 @@ -0,0 +1,205 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""This module has all statistic related transforms."""
+
+from __future__ import absolute_import
+from __future__ import division
+
+import heapq
+import math
+import sys
+from builtins import round
+
+import mmh3
+
+from apache_beam.transforms.core import *
+from apache_beam.transforms.ptransform import PTransform
+
+__all__ = [
+'ApproximateUniqueGlobally',
+'ApproximateUniquePerKey',
+]
+
+
+class ApproximateUniqueGlobally(PTransform):
+  """
+  Hashes input elements and uses those to extrapolate the size of the entire
+  set of hash values by assuming the rest of the hash values are as densely
+  distributed as the sample space.
+
+  Args:
+**kwargs: Accepts a single named argument "size" or "error".
+size: an int not smaller than 16, which we would use to estimate
+  number of unique values.
+error: max estimation error, which is a float between 0.01
+  and 0.50. If error is given, size will be calculated from error with
+  _get_sample_size_from_est_error function.
+  """
+
+  _NO_VALUE_ERR_MSG = 'Either size or error should be set. Received {}.'
+  _MULTI_VALUE_ERR_MSG = 'Either size or error should be set. ' \
+ 'Received {size = %s, error = %s}.'
+  _INPUT_SIZE_ERR_MSG = 'ApproximateUnique needs a size >= 16 for an error ' \
+'<= 0.50. In general, the estimation error is about ' \
+'2 / sqrt(sample_size). Received {size = %s}.'
+  _INPUT_ERROR_ERR_MSG = 'ApproximateUnique needs an estimation error ' \
+ 'between 0.01 and 0.50. Received {error = %s}.'
+
+  def __init__(self, size=None, error=None):
+
+if None not in (size, error):
+  raise ValueError(self._MULTI_VALUE_ERR_MSG % (size, error))
+elif size is None and error is None:
+  raise ValueError(self._NO_VALUE_ERR_MSG)
+elif size is not None:
+  if not isinstance(size, int) or size < 16:
+raise ValueError(self._INPUT_SIZE_ERR_MSG % (size))
+  else:
+self._sample_size = size
+self._max_est_err = None
+else:
+  if error < 0.01 or error > 0.5:
+raise ValueError(self._INPUT_ERROR_ERR_MSG % (error))
+  else:
+self._sample_size = self._get_sample_size_from_est_error(error)
+self._max_est_err = error
+
+  def expand(self, pcoll):
+return pcoll \
+   | 'CountGlobalUniqueValues' \
+   >> 
(CombineGlobally(ApproximateUniqueCombineDoFn(self._sample_size)))
+
+  @staticmethod
+  def _get_sample_size_from_est_error(est_err):
+"""
+:return: sample size
+
+Calculate sample size from estimation error
+"""
+# math.ceil in python 2.7 returns float, while it returns int in python 3.
+return int(math.ceil(4.0 / math.pow(est_err, 2.0)))
+
+
+class ApproximateUniquePerKey(ApproximateUniqueGlobally):
+
+  def expand(self, pcoll):
+return pcoll \
+   | 'CountPerKeyUniqueValues' \
+   >> (CombinePerKey(ApproximateUniqueCombineDoFn(self._sample_size)))
+
+
+class _LargestUnique(object):
+  """
+  An object to keep samples and calculate sample hash space. It is an
+  accumulator of a combine function.
+  """
+  _HASH_SPACE_SIZE = 2.0 * sys.maxsize
+
+  def __init__(self, sample_size):
+self._sample_size = sample_size
+self._min_hash = sys.maxsize
+self._sample_heap = []
+self._sample_set = set()
+
+  def add(self, element):
+"""
+:param an element from pcoll.
+:return: boolean type whether the value is 

[jira] [Work logged] (BEAM-6693) ApproximateUnique transform for Python SDK

2019-05-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-6693?focusedWorklogId=241991=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-241991
 ]

ASF GitHub Bot logged work on BEAM-6693:


Author: ASF GitHub Bot
Created on: 14/May/19 19:54
Start Date: 14/May/19 19:54
Worklog Time Spent: 10m 
  Work Description: pabloem commented on pull request #8535: [BEAM-6693] 
ApproximateUnique transform for Python SDK
URL: https://github.com/apache/beam/pull/8535#discussion_r283529625
 
 

 ##
 File path: sdks/python/apache_beam/transforms/stats.py
 ##
 @@ -0,0 +1,206 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Core PTransform subclasses, such as FlatMap, GroupByKey, and Map."""
+
+from __future__ import absolute_import
+from __future__ import division
+
+import heapq
+import logging
+import math
+import sys
+from builtins import round
+
+from apache_beam.transforms.core import *
+from apache_beam.transforms.ptransform import PTransform
+
+try:
+  import mmh3
+except ImportError:
+  logging.info('Python version >=3.0 uses buildin hash function.')
+
+__all__ = [
+'ApproximateUniqueGlobally',
+'ApproximateUniquePerKey',
+]
+
+
+class ApproximateUniqueGlobally(PTransform):
+  """
+  Hashes input elements and uses those to extrapolate the size of the entire
+  set of hash values by assuming the rest of the hash values are as densely
+  distributed as the sample space.
+
+  Args:
+**kwargs: Accepts a single named argument "size" or "error".
+size: an int not smaller than 16, which we would use to estimate
+  number of unique values.
+error: max estimation error, which is a float between 0.01
+  and 0.50. If error is given, size will be calculated from error with
+  _get_sample_size_from_est_error function.
+  """
+
+  _NO_VALUE_ERR_MSG = 'Either size or error should be set. Received {}.'
+  _MULTI_VALUE_ERR_MSG = 'Either size or error should be set. ' \
+ 'Received {size = %s, error = %s}.'
+  _INPUT_SIZE_ERR_MSG = 'ApproximateUnique needs a size >= 16 for an error ' \
+'<= 0.50. In general, the estimation error is about ' \
+'2 / sqrt(sample_size). Received {size = %s}.'
+  _INPUT_ERROR_ERR_MSG = 'ApproximateUnique needs an estimation error ' \
+ 'between 0.01 and 0.50. Received {error = %s}.'
+
+  def __init__(self, **kwargs):
+input_size = kwargs.pop('size', None)
+input_err = kwargs.pop('error', None)
+
+if None not in (input_size, input_err):
+  raise ValueError(self._MULTI_VALUE_ERR_MSG % (input_size, input_err))
+elif input_size is None and input_err is None:
+  raise ValueError(self._NO_VALUE_ERR_MSG)
+elif input_size is not None:
+  if not isinstance(input_size, int) or input_size < 16:
+raise ValueError(self._INPUT_SIZE_ERR_MSG % (input_size))
+  else:
+self._sample_size = input_size
+self._max_est_err = None
+else:
+  if input_err < 0.01 or input_err > 0.5:
+raise ValueError(self._INPUT_ERROR_ERR_MSG % (input_err))
+  else:
+self._sample_size = self._get_sample_size_from_est_error(input_err)
+self._max_est_err = input_err
+
+  def expand(self, pcoll):
+return pcoll \
+   | 'CountGlobalUniqueValues' \
+   >> 
(CombineGlobally(ApproximateUniqueCombineDoFn(self._sample_size)))
+
+  @staticmethod
+  def _get_sample_size_from_est_error(est_err):
+"""
+:return: sample size
+
+Calculate sample size from estimation error
+"""
+return int(math.ceil(4.0 / math.pow(est_err, 2.0)))
+
+
+class ApproximateUniquePerKey(ApproximateUniqueGlobally):
+
+  def expand(self, pcoll):
+return pcoll \
+   | 'CountPerKeyUniqueValues' \
+   >> (CombinePerKey(ApproximateUniqueCombineDoFn(self._sample_size)))
+
+
+class _LargestUnique(object):
+  """
+  An object to keep samples and calculate sample hash space. It is an
+  accumulator of a combine function.
+  """
+  _HASH_SPACE_SIZE = 2.0 * sys.maxsize
+
+  def __init__(self, sample_size):
+self._sample_size = sample_size
+

[jira] [Work logged] (BEAM-6693) ApproximateUnique transform for Python SDK

2019-05-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-6693?focusedWorklogId=241993=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-241993
 ]

ASF GitHub Bot logged work on BEAM-6693:


Author: ASF GitHub Bot
Created on: 14/May/19 19:54
Start Date: 14/May/19 19:54
Worklog Time Spent: 10m 
  Work Description: pabloem commented on pull request #8535: [BEAM-6693] 
ApproximateUnique transform for Python SDK
URL: https://github.com/apache/beam/pull/8535#discussion_r283536855
 
 

 ##
 File path: sdks/python/apache_beam/transforms/stats_test.py
 ##
 @@ -0,0 +1,384 @@
+# -*- coding: utf-8 -*-
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from __future__ import absolute_import
+from __future__ import division
+
+import math
+import random
+import unittest
+from collections import defaultdict
+
+import numpy as np
+
+import apache_beam as beam
+from apache_beam.testing.test_pipeline import TestPipeline
+from apache_beam.testing.util import assert_that
+from apache_beam.testing.util import equal_to
+
+
+class ApproximateUniqueTest(unittest.TestCase):
+  """Unit tests for ApproximateUniqueGlobally and ApproximateUniquePerKey."""
+
+  def test_approximate_unique_global_by_invalid_size(self):
+# test if the transformation throws an error as expected with an invalid
+# small input size (< 16).
+sample_size = 10
+test_input = [random.randint(0, 1000) for _ in range(100)]
+
+with self.assertRaises(ValueError) as e:
+  pipeline = TestPipeline()
+  _ = (pipeline
+   | 'create'
+   >> beam.Create(test_input)
+   | 'get_estimate'
+   >> beam.ApproximateUniqueGlobally(size=sample_size))
+  pipeline.run()
+
+expected_msg = beam.ApproximateUniqueGlobally._INPUT_SIZE_ERR_MSG % (
+sample_size)
+
+assert e.exception.args[0] == expected_msg
+
+  def test_approximate_unique_global_by_invalid_type_size(self):
+# test if the transformation throws an error as expected with an invalid
+# type of input size (not int).
+sample_size = 100.0
+test_input = [random.randint(0, 1000) for _ in range(100)]
+
+with self.assertRaises(ValueError) as e:
+  pipeline = TestPipeline()
+  _ = (pipeline
+   | 'create' >> beam.Create(test_input)
+   | 'get_estimate'
+   >> beam.ApproximateUniqueGlobally(size=sample_size))
+  pipeline.run()
+
+expected_msg = beam.ApproximateUniqueGlobally._INPUT_SIZE_ERR_MSG % (
+sample_size)
+
+assert e.exception.args[0] == expected_msg
+
+  def test_approximate_unique_global_by_invalid_small_error(self):
+# test if the transformation throws an error as expected with an invalid
+# small input error (< 0.01).
+est_err = 0.0
+test_input = [random.randint(0, 1000) for _ in range(100)]
+
+with self.assertRaises(ValueError) as e:
+  pipeline = TestPipeline()
+  _ = (pipeline
+   | 'create' >> beam.Create(test_input)
+   | 'get_estimate'
+   >> beam.ApproximateUniqueGlobally(error=est_err))
+  pipeline.run()
+
+expected_msg = beam.ApproximateUniqueGlobally._INPUT_ERROR_ERR_MSG % (
+est_err)
+
+assert e.exception.args[0] == expected_msg
+
+  def test_approximate_unique_global_by_invalid_big_error(self):
+# test if the transformation throws an error as expected with an invalid
+# big input error (> 0.50).
+est_err = 0.6
+test_input = [random.randint(0, 1000) for _ in range(100)]
+
+with self.assertRaises(ValueError) as e:
+  pipeline = TestPipeline()
+  _ = (pipeline
+   | 'create' >> beam.Create(test_input)
+   | 'get_estimate'
+   >> beam.ApproximateUniqueGlobally(error=est_err))
+  pipeline.run()
+
+expected_msg = beam.ApproximateUniqueGlobally._INPUT_ERROR_ERR_MSG % (
+est_err)
+
+assert e.exception.args[0] == expected_msg
+
+  def test_approximate_unique_global_by_invalid_no_input(self):
+# test if the transformation throws an error as expected with no input.
+test_input = [random.randint(0, 1000) for _ in range(100)]
+
+with self.assertRaises(ValueError) 

[jira] [Work logged] (BEAM-6693) ApproximateUnique transform for Python SDK

2019-05-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-6693?focusedWorklogId=241996=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-241996
 ]

ASF GitHub Bot logged work on BEAM-6693:


Author: ASF GitHub Bot
Created on: 14/May/19 19:54
Start Date: 14/May/19 19:54
Worklog Time Spent: 10m 
  Work Description: pabloem commented on pull request #8535: [BEAM-6693] 
ApproximateUnique transform for Python SDK
URL: https://github.com/apache/beam/pull/8535#discussion_r283547926
 
 

 ##
 File path: sdks/python/apache_beam/transforms/stats_test.py
 ##
 @@ -0,0 +1,384 @@
+# -*- coding: utf-8 -*-
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from __future__ import absolute_import
+from __future__ import division
+
+import math
+import random
+import unittest
+from collections import defaultdict
+
+import numpy as np
+
+import apache_beam as beam
+from apache_beam.testing.test_pipeline import TestPipeline
+from apache_beam.testing.util import assert_that
+from apache_beam.testing.util import equal_to
+
+
+class ApproximateUniqueTest(unittest.TestCase):
+  """Unit tests for ApproximateUniqueGlobally and ApproximateUniquePerKey."""
+
+  def test_approximate_unique_global_by_invalid_size(self):
+# test if the transformation throws an error as expected with an invalid
+# small input size (< 16).
+sample_size = 10
+test_input = [random.randint(0, 1000) for _ in range(100)]
+
+with self.assertRaises(ValueError) as e:
+  pipeline = TestPipeline()
+  _ = (pipeline
+   | 'create'
+   >> beam.Create(test_input)
+   | 'get_estimate'
+   >> beam.ApproximateUniqueGlobally(size=sample_size))
+  pipeline.run()
+
+expected_msg = beam.ApproximateUniqueGlobally._INPUT_SIZE_ERR_MSG % (
+sample_size)
+
+assert e.exception.args[0] == expected_msg
+
+  def test_approximate_unique_global_by_invalid_type_size(self):
+# test if the transformation throws an error as expected with an invalid
+# type of input size (not int).
+sample_size = 100.0
+test_input = [random.randint(0, 1000) for _ in range(100)]
+
+with self.assertRaises(ValueError) as e:
+  pipeline = TestPipeline()
+  _ = (pipeline
+   | 'create' >> beam.Create(test_input)
+   | 'get_estimate'
+   >> beam.ApproximateUniqueGlobally(size=sample_size))
+  pipeline.run()
+
+expected_msg = beam.ApproximateUniqueGlobally._INPUT_SIZE_ERR_MSG % (
+sample_size)
+
+assert e.exception.args[0] == expected_msg
+
+  def test_approximate_unique_global_by_invalid_small_error(self):
+# test if the transformation throws an error as expected with an invalid
+# small input error (< 0.01).
+est_err = 0.0
+test_input = [random.randint(0, 1000) for _ in range(100)]
+
+with self.assertRaises(ValueError) as e:
+  pipeline = TestPipeline()
+  _ = (pipeline
+   | 'create' >> beam.Create(test_input)
+   | 'get_estimate'
+   >> beam.ApproximateUniqueGlobally(error=est_err))
+  pipeline.run()
+
+expected_msg = beam.ApproximateUniqueGlobally._INPUT_ERROR_ERR_MSG % (
+est_err)
+
+assert e.exception.args[0] == expected_msg
+
+  def test_approximate_unique_global_by_invalid_big_error(self):
+# test if the transformation throws an error as expected with an invalid
+# big input error (> 0.50).
+est_err = 0.6
+test_input = [random.randint(0, 1000) for _ in range(100)]
+
+with self.assertRaises(ValueError) as e:
+  pipeline = TestPipeline()
+  _ = (pipeline
+   | 'create' >> beam.Create(test_input)
+   | 'get_estimate'
+   >> beam.ApproximateUniqueGlobally(error=est_err))
+  pipeline.run()
+
+expected_msg = beam.ApproximateUniqueGlobally._INPUT_ERROR_ERR_MSG % (
+est_err)
+
+assert e.exception.args[0] == expected_msg
+
+  def test_approximate_unique_global_by_invalid_no_input(self):
+# test if the transformation throws an error as expected with no input.
+test_input = [random.randint(0, 1000) for _ in range(100)]
+
+with self.assertRaises(ValueError) 

[jira] [Work logged] (BEAM-6693) ApproximateUnique transform for Python SDK

2019-05-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-6693?focusedWorklogId=241990=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-241990
 ]

ASF GitHub Bot logged work on BEAM-6693:


Author: ASF GitHub Bot
Created on: 14/May/19 19:54
Start Date: 14/May/19 19:54
Worklog Time Spent: 10m 
  Work Description: pabloem commented on pull request #8535: [BEAM-6693] 
ApproximateUnique transform for Python SDK
URL: https://github.com/apache/beam/pull/8535#discussion_r283967191
 
 

 ##
 File path: sdks/python/apache_beam/transforms/stats.py
 ##
 @@ -0,0 +1,205 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""This module has all statistic related transforms."""
+
+from __future__ import absolute_import
+from __future__ import division
+
+import heapq
+import math
+import sys
+from builtins import round
+
+import mmh3
+
+from apache_beam.transforms.core import *
+from apache_beam.transforms.ptransform import PTransform
+
+__all__ = [
+'ApproximateUniqueGlobally',
+'ApproximateUniquePerKey',
+]
+
+
+class ApproximateUniqueGlobally(PTransform):
+  """
+  Hashes input elements and uses those to extrapolate the size of the entire
+  set of hash values by assuming the rest of the hash values are as densely
+  distributed as the sample space.
+
+  Args:
+**kwargs: Accepts a single named argument "size" or "error".
+size: an int not smaller than 16, which we would use to estimate
+  number of unique values.
+error: max estimation error, which is a float between 0.01
+  and 0.50. If error is given, size will be calculated from error with
+  _get_sample_size_from_est_error function.
+  """
+
+  _NO_VALUE_ERR_MSG = 'Either size or error should be set. Received {}.'
+  _MULTI_VALUE_ERR_MSG = 'Either size or error should be set. ' \
+ 'Received {size = %s, error = %s}.'
+  _INPUT_SIZE_ERR_MSG = 'ApproximateUnique needs a size >= 16 for an error ' \
+'<= 0.50. In general, the estimation error is about ' \
+'2 / sqrt(sample_size). Received {size = %s}.'
+  _INPUT_ERROR_ERR_MSG = 'ApproximateUnique needs an estimation error ' \
+ 'between 0.01 and 0.50. Received {error = %s}.'
+
+  def __init__(self, size=None, error=None):
+
+if None not in (size, error):
+  raise ValueError(self._MULTI_VALUE_ERR_MSG % (size, error))
+elif size is None and error is None:
+  raise ValueError(self._NO_VALUE_ERR_MSG)
+elif size is not None:
+  if not isinstance(size, int) or size < 16:
+raise ValueError(self._INPUT_SIZE_ERR_MSG % (size))
+  else:
+self._sample_size = size
+self._max_est_err = None
+else:
+  if error < 0.01 or error > 0.5:
+raise ValueError(self._INPUT_ERROR_ERR_MSG % (error))
+  else:
+self._sample_size = self._get_sample_size_from_est_error(error)
+self._max_est_err = error
+
+  def expand(self, pcoll):
+return pcoll \
+   | 'CountGlobalUniqueValues' \
+   >> 
(CombineGlobally(ApproximateUniqueCombineDoFn(self._sample_size)))
+
+  @staticmethod
+  def _get_sample_size_from_est_error(est_err):
+"""
+:return: sample size
+
+Calculate sample size from estimation error
+"""
+# math.ceil in python 2.7 returns float, while it returns int in python 3.
+return int(math.ceil(4.0 / math.pow(est_err, 2.0)))
+
+
+class ApproximateUniquePerKey(ApproximateUniqueGlobally):
+
+  def expand(self, pcoll):
+return pcoll \
+   | 'CountPerKeyUniqueValues' \
+   >> (CombinePerKey(ApproximateUniqueCombineDoFn(self._sample_size)))
+
+
+class _LargestUnique(object):
+  """
+  An object to keep samples and calculate sample hash space. It is an
+  accumulator of a combine function.
+  """
+  _HASH_SPACE_SIZE = 2.0 * sys.maxsize
+
+  def __init__(self, sample_size):
+self._sample_size = sample_size
+self._min_hash = sys.maxsize
+self._sample_heap = []
+self._sample_set = set()
+
+  def add(self, element):
+"""
+:param an element from pcoll.
+:return: boolean type whether the value is 

[jira] [Work logged] (BEAM-6693) ApproximateUnique transform for Python SDK

2019-05-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-6693?focusedWorklogId=241995=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-241995
 ]

ASF GitHub Bot logged work on BEAM-6693:


Author: ASF GitHub Bot
Created on: 14/May/19 19:54
Start Date: 14/May/19 19:54
Worklog Time Spent: 10m 
  Work Description: pabloem commented on pull request #8535: [BEAM-6693] 
ApproximateUnique transform for Python SDK
URL: https://github.com/apache/beam/pull/8535#discussion_r283524460
 
 

 ##
 File path: sdks/python/apache_beam/transforms/stats.py
 ##
 @@ -0,0 +1,206 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""Core PTransform subclasses, such as FlatMap, GroupByKey, and Map."""
+
+from __future__ import absolute_import
+from __future__ import division
+
+import heapq
+import logging
+import math
+import sys
+from builtins import round
+
+from apache_beam.transforms.core import *
+from apache_beam.transforms.ptransform import PTransform
+
+try:
+  import mmh3
+except ImportError:
+  logging.info('Python version >=3.0 uses buildin hash function.')
+
+__all__ = [
+'ApproximateUniqueGlobally',
+'ApproximateUniquePerKey',
+]
+
+
+class ApproximateUniqueGlobally(PTransform):
+  """
+  Hashes input elements and uses those to extrapolate the size of the entire
+  set of hash values by assuming the rest of the hash values are as densely
+  distributed as the sample space.
+
+  Args:
+**kwargs: Accepts a single named argument "size" or "error".
+size: an int not smaller than 16, which we would use to estimate
+  number of unique values.
+error: max estimation error, which is a float between 0.01
+  and 0.50. If error is given, size will be calculated from error with
+  _get_sample_size_from_est_error function.
+  """
+
+  _NO_VALUE_ERR_MSG = 'Either size or error should be set. Received {}.'
+  _MULTI_VALUE_ERR_MSG = 'Either size or error should be set. ' \
+ 'Received {size = %s, error = %s}.'
+  _INPUT_SIZE_ERR_MSG = 'ApproximateUnique needs a size >= 16 for an error ' \
+'<= 0.50. In general, the estimation error is about ' \
+'2 / sqrt(sample_size). Received {size = %s}.'
+  _INPUT_ERROR_ERR_MSG = 'ApproximateUnique needs an estimation error ' \
+ 'between 0.01 and 0.50. Received {error = %s}.'
+
+  def __init__(self, **kwargs):
+input_size = kwargs.pop('size', None)
+input_err = kwargs.pop('error', None)
+
+if None not in (input_size, input_err):
+  raise ValueError(self._MULTI_VALUE_ERR_MSG % (input_size, input_err))
+elif input_size is None and input_err is None:
+  raise ValueError(self._NO_VALUE_ERR_MSG)
+elif input_size is not None:
+  if not isinstance(input_size, int) or input_size < 16:
+raise ValueError(self._INPUT_SIZE_ERR_MSG % (input_size))
+  else:
+self._sample_size = input_size
+self._max_est_err = None
+else:
+  if input_err < 0.01 or input_err > 0.5:
+raise ValueError(self._INPUT_ERROR_ERR_MSG % (input_err))
+  else:
+self._sample_size = self._get_sample_size_from_est_error(input_err)
+self._max_est_err = input_err
+
+  def expand(self, pcoll):
+return pcoll \
+   | 'CountGlobalUniqueValues' \
+   >> 
(CombineGlobally(ApproximateUniqueCombineDoFn(self._sample_size)))
+
+  @staticmethod
+  def _get_sample_size_from_est_error(est_err):
+"""
+:return: sample size
+
+Calculate sample size from estimation error
+"""
+return int(math.ceil(4.0 / math.pow(est_err, 2.0)))
+
+
+class ApproximateUniquePerKey(ApproximateUniqueGlobally):
+
+  def expand(self, pcoll):
+return pcoll \
+   | 'CountPerKeyUniqueValues' \
+   >> (CombinePerKey(ApproximateUniqueCombineDoFn(self._sample_size)))
+
+
+class _LargestUnique(object):
+  """
+  An object to keep samples and calculate sample hash space. It is an
+  accumulator of a combine function.
+  """
+  _HASH_SPACE_SIZE = 2.0 * sys.maxsize
+
+  def __init__(self, sample_size):
+self._sample_size = sample_size
+

[jira] [Commented] (BEAM-7302) Dependencies are broken in SNAPSHOT pom files

2019-05-14 Thread JIRA


[ 
https://issues.apache.org/jira/browse/BEAM-7302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16839738#comment-16839738
 ] 

Ismaël Mejía commented on BEAM-7302:


Maybe [~lcwik] knows the reason.

> Dependencies are broken in SNAPSHOT pom files
> -
>
> Key: BEAM-7302
> URL: https://issues.apache.org/jira/browse/BEAM-7302
> Project: Beam
>  Issue Type: Bug
>  Components: build-system
>Affects Versions: 2.14.0
>Reporter: Ismaël Mejía
>Assignee: Michael Luckey
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The generated pom in the SNAPSHOTS repository points to dependencies that 
> don't have the correct name, for example [the beam-sdks-java-core 
> pom|https://repository.apache.org/content/repositories/snapshots/org/apache/beam/beam-sdks-java-core/2.14.0-SNAPSHOT/beam-sdks-java-core-2.14.0-20190514.072148-6.pom]
>  points to
> {code}
> 
>   beam.model
>   pipeline
>   2.14.0-SNAPSHOT
>   compile
> 
> {code}
> but such groupId and artifactId do not exist (and have not existed in the 
> past).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-7302) Dependencies are broken in SNAPSHOT pom files

2019-05-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7302?focusedWorklogId=241988=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-241988
 ]

ASF GitHub Bot logged work on BEAM-7302:


Author: ASF GitHub Bot
Created on: 14/May/19 19:42
Start Date: 14/May/19 19:42
Worklog Time Spent: 10m 
  Work Description: lukecwik commented on pull request #8577: [BEAM-7302] 
Fix project dependencies in pom.xml outputs
URL: https://github.com/apache/beam/pull/8577#discussion_r283967112
 
 

 ##
 File path: 
buildSrc/src/main/groovy/org/apache/beam/gradle/BeamModulePlugin.groovy
 ##
 @@ -36,6 +36,9 @@ import org.gradle.api.tasks.compile.JavaCompile
 import org.gradle.api.tasks.javadoc.Javadoc
 import org.gradle.api.tasks.testing.Test
 import org.gradle.testing.jacoco.tasks.JacocoReport
+
+import java.util.concurrent.atomic.AtomicInteger
 
 Review comment:
   nit: fix import ordering
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 241988)
Time Spent: 20m  (was: 10m)

> Dependencies are broken in SNAPSHOT pom files
> -
>
> Key: BEAM-7302
> URL: https://issues.apache.org/jira/browse/BEAM-7302
> Project: Beam
>  Issue Type: Bug
>  Components: build-system
>Affects Versions: 2.14.0
>Reporter: Ismaël Mejía
>Assignee: Michael Luckey
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The generated pom in the SNAPSHOTS repository points to dependencies that 
> don't have the correct name, for example [the beam-sdks-java-core 
> pom|https://repository.apache.org/content/repositories/snapshots/org/apache/beam/beam-sdks-java-core/2.14.0-SNAPSHOT/beam-sdks-java-core-2.14.0-20190514.072148-6.pom]
>  points to
> {code}
> 
>   beam.model
>   pipeline
>   2.14.0-SNAPSHOT
>   compile
> 
> {code}
> but such groupId and artifactId do not exist (and have not existed in the 
> past).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (BEAM-7303) Move Portable Runner and other of reference runner.

2019-05-14 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/BEAM-7303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ismaël Mejía updated BEAM-7303:
---
Status: Open  (was: Triage Needed)

> Move Portable Runner and other of reference runner.
> ---
>
> Key: BEAM-7303
> URL: https://issues.apache.org/jira/browse/BEAM-7303
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core
>Reporter: Ankur Goenka
>Assignee: Ankur Goenka
>Priority: Major
> Fix For: 2.14.0
>
>
> PortableRunner is used by all Flink, Spark ... . 
> We should move it out of Reference Runner package to stream line the 
> dependencies.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-6493) examples in Kotlin

2019-05-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-6493?focusedWorklogId=241972=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-241972
 ]

ASF GitHub Bot logged work on BEAM-6493:


Author: ASF GitHub Bot
Created on: 14/May/19 19:22
Start Date: 14/May/19 19:22
Worklog Time Spent: 10m 
  Work Description: pabloem commented on issue #8439: [BEAM-6493] : Add 
cookbook and snippets examples in Kotlin
URL: https://github.com/apache/beam/pull/8439#issuecomment-492374644
 
 
   This is an issue for a later change, but probably the kotlin examples 
precommit should run as part of the Java precommit. Thoughts?
   Anyway - this change lgtm. ShouldI merge it?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 241972)
Time Spent: 12.5h  (was: 12h 20m)
Remaining Estimate: 492.5h  (was: 492h 40m)

> examples in Kotlin
> --
>
> Key: BEAM-6493
> URL: https://issues.apache.org/jira/browse/BEAM-6493
> Project: Beam
>  Issue Type: Task
>  Components: examples-java
>Affects Versions: Not applicable
>Reporter: Harshit Dwivedi
>Assignee: Harshit Dwivedi
>Priority: Minor
>  Labels: documentation
> Fix For: Not applicable
>
>   Original Estimate: 504h
>  Time Spent: 12.5h
>  Remaining Estimate: 492.5h
>
> I have been using Apache Beam for few of my projects in production since the 
> past 6 months and apart from Java, [Kotlin|https://kotlinlang.org/] also 
> seems to work as well with no issues whatsoever.
> But currently, the Github Repository of Apache Beam contains examples only in 
> Java which might be an issue for other developers who want to use Apache Beam 
> SDK with kotlin as there are no sample resources available.
> That said, I would love to go ahead and add kotlin examples alongside the 
> current java examples in the [Beam 
> repository|https://github.com/apache/beam/tree/master/examples/java].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-6493) examples in Kotlin

2019-05-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-6493?focusedWorklogId=241970=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-241970
 ]

ASF GitHub Bot logged work on BEAM-6493:


Author: ASF GitHub Bot
Created on: 14/May/19 19:19
Start Date: 14/May/19 19:19
Worklog Time Spent: 10m 
  Work Description: pabloem commented on pull request #8439: [BEAM-6493] : 
Add cookbook and snippets examples in Kotlin
URL: https://github.com/apache/beam/pull/8439#discussion_r283959884
 
 

 ##
 File path: 
examples/kotlin/src/main/java/org/apache/beam/examples/kotlin/cookbook/README.md
 ##
 @@ -0,0 +1,71 @@
+
+
+# "Cookbook" Examples
+
+This directory holds simple "cookbook" examples, which show how to define
+commonly-used data analysis patterns that you would likely incorporate into a
+larger Apache Beam pipeline. They include:
+
+ 
+  https://github.com/apache/beam/blob/master/examples/java/src/main/java/org/apache/beam/examples/cookbook/BigQueryTornadoes.java;>BigQueryTornadoes
 
 Review comment:
   That sounds good to me.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 241970)
Time Spent: 12h 20m  (was: 12h 10m)
Remaining Estimate: 492h 40m  (was: 492h 50m)

> examples in Kotlin
> --
>
> Key: BEAM-6493
> URL: https://issues.apache.org/jira/browse/BEAM-6493
> Project: Beam
>  Issue Type: Task
>  Components: examples-java
>Affects Versions: Not applicable
>Reporter: Harshit Dwivedi
>Assignee: Harshit Dwivedi
>Priority: Minor
>  Labels: documentation
> Fix For: Not applicable
>
>   Original Estimate: 504h
>  Time Spent: 12h 20m
>  Remaining Estimate: 492h 40m
>
> I have been using Apache Beam for few of my projects in production since the 
> past 6 months and apart from Java, [Kotlin|https://kotlinlang.org/] also 
> seems to work as well with no issues whatsoever.
> But currently, the Github Repository of Apache Beam contains examples only in 
> Java which might be an issue for other developers who want to use Apache Beam 
> SDK with kotlin as there are no sample resources available.
> That said, I would love to go ahead and add kotlin examples alongside the 
> current java examples in the [Beam 
> repository|https://github.com/apache/beam/tree/master/examples/java].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-7305) Add first version of Hazelcast Jet Runner

2019-05-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7305?focusedWorklogId=241964=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-241964
 ]

ASF GitHub Bot logged work on BEAM-7305:


Author: ASF GitHub Bot
Created on: 14/May/19 19:10
Start Date: 14/May/19 19:10
Worklog Time Spent: 10m 
  Work Description: mxm commented on issue #8410: [BEAM-7305] Add first 
version of Hazelcast Jet based Java Runner
URL: https://github.com/apache/beam/pull/8410#issuecomment-492370158
 
 
   Thanks @jbartok! Could you assign this JIRA issue to yourself? 
https://jira.apache.org/jira/browse/BEAM-7305 If you do not have permissions, 
please ping me and provide your JIRA username.
   
   You mentioned that you wanted to include the Runner in the release. Please 
open a PR against the `release-2.13.0` branch for that.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 241964)
Time Spent: 20m  (was: 10m)

> Add first version of Hazelcast Jet Runner
> -
>
> Key: BEAM-7305
> URL: https://issues.apache.org/jira/browse/BEAM-7305
> Project: Beam
>  Issue Type: New Feature
>  Components: runner-jet
>Reporter: Maximilian Michels
>Priority: Major
> Fix For: 2.13.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Work logged] (BEAM-7305) Add first version of Hazelcast Jet Runner

2019-05-14 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/BEAM-7305?focusedWorklogId=241962=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-241962
 ]

ASF GitHub Bot logged work on BEAM-7305:


Author: ASF GitHub Bot
Created on: 14/May/19 19:08
Start Date: 14/May/19 19:08
Worklog Time Spent: 10m 
  Work Description: mxm commented on pull request #8410: [BEAM-7305] Add 
first version of Hazelcast Jet based Java Runner
URL: https://github.com/apache/beam/pull/8410
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 241962)
Time Spent: 10m
Remaining Estimate: 0h

> Add first version of Hazelcast Jet Runner
> -
>
> Key: BEAM-7305
> URL: https://issues.apache.org/jira/browse/BEAM-7305
> Project: Beam
>  Issue Type: New Feature
>  Components: runner-jet
>Reporter: Maximilian Michels
>Priority: Major
> Fix For: 2.13.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


  1   2   3   >