[jira] [Commented] (BEAM-570) Update AvroSource to support more compression types

2016-10-04 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15547423#comment-15547423
 ] 

ASF GitHub Bot commented on BEAM-570:
-

GitHub user katsiapis opened a pull request:

https://github.com/apache/incubator-beam/pull/1053

[BEAM-570] Title of the pull request

- Getting rid of CompressionTypes.ZLIB and CompressionTypes.NO_COMPRESSION.
- Introducing BZIP2 compression in analogy to Dataflow Java's BZIP2, 
towards resolution of https://issues.apache.org/jira/browse/BEAM-570.
- Introducing SNAPPY codec support for AVRO conciseness and in order to 
fully resolve https://issues.apache.org/jira/browse/BEAM-570.
- Moving avroio from compression_type to codec as per various discussions.
- A few cleanups in avroio.
- Making textio more DRY and doing a few cleanups.
- Raising exceptions when splitting is requested for compressed source 
since that should never happen (guaranteed by the service for the supported 
compression types).
- Using cStringIO instead of StringIO in various places as decided in some 
other discussions.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/katsiapis/incubator-beam bz2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-beam/pull/1053.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1053


commit bd44e76b80e4edf4f922b9a26f7b359c4ede2008
Author: Gus Katsiapis 
Date:   2016-10-05T02:41:07Z

Several enhancements to Dataflow (part 2 of 2).




> Update AvroSource to support more compression types
> ---
>
> Key: BEAM-570
> URL: https://issues.apache.org/jira/browse/BEAM-570
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py
>Reporter: Chamikara Jayalath
>Assignee: Chamikara Jayalath
>
> Python AvroSource [1] currently only support 'deflate' compression. We should 
> update it to support other compression types supported by the Avro library 
> (e.g.: snappy, bzip2).
> [1] 
> https://github.com/apache/incubator-beam/blob/python-sdk/sdks/python/apache_beam/io/avroio.py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] incubator-beam pull request #1053: [BEAM-570] Title of the pull request

2016-10-04 Thread katsiapis
GitHub user katsiapis opened a pull request:

https://github.com/apache/incubator-beam/pull/1053

[BEAM-570] Title of the pull request

- Getting rid of CompressionTypes.ZLIB and CompressionTypes.NO_COMPRESSION.
- Introducing BZIP2 compression in analogy to Dataflow Java's BZIP2, 
towards resolution of https://issues.apache.org/jira/browse/BEAM-570.
- Introducing SNAPPY codec support for AVRO conciseness and in order to 
fully resolve https://issues.apache.org/jira/browse/BEAM-570.
- Moving avroio from compression_type to codec as per various discussions.
- A few cleanups in avroio.
- Making textio more DRY and doing a few cleanups.
- Raising exceptions when splitting is requested for compressed source 
since that should never happen (guaranteed by the service for the supported 
compression types).
- Using cStringIO instead of StringIO in various places as decided in some 
other discussions.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/katsiapis/incubator-beam bz2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-beam/pull/1053.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1053


commit bd44e76b80e4edf4f922b9a26f7b359c4ede2008
Author: Gus Katsiapis 
Date:   2016-10-05T02:41:07Z

Several enhancements to Dataflow (part 2 of 2).




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-beam pull request #1052: Delete DatastoreWordCount

2016-10-04 Thread dhalperi
GitHub user dhalperi opened a pull request:

https://github.com/apache/incubator-beam/pull/1052

Delete DatastoreWordCount

Be sure to do all of the following to help us incorporate your contribution
quickly and easily:

 - [ ] Make sure the PR title is formatted like:
   `[BEAM-] Description of pull request`
 - [ ] Make sure tests pass via `mvn clean verify`. (Even better, enable
   Travis-CI on your fork and ensure the whole test matrix passes).
 - [ ] Replace `` in the title with the actual Jira issue
   number, if there is one.
 - [ ] If this contribution is large, please file an Apache
   [Individual Contributor License 
Agreement](https://www.apache.org/licenses/icla.txt).

---

This is the kind of example we do not need to have in Beam. It's just 
WordCount with a different data source.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dhalperi/incubator-beam patch-1

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-beam/pull/1052.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1052


commit d5d3af967008e342530622500f252d7b108f39cf
Author: Daniel Halperin 
Date:   2016-10-05T02:23:54Z

Delete DatastoreWordCount

This is the kind of example we do not need to have in Beam. It's just 
WordCount with a different data source.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-beam pull request #1051: Add RootTransformEvaluatorFactory

2016-10-04 Thread tgroh
GitHub user tgroh opened a pull request:

https://github.com/apache/incubator-beam/pull/1051

Add RootTransformEvaluatorFactory

Be sure to do all of the following to help us incorporate your contribution
quickly and easily:

 - [ ] Make sure the PR title is formatted like:
   `[BEAM-] Description of pull request`
 - [ ] Make sure tests pass via `mvn clean verify`. (Even better, enable
   Travis-CI on your fork and ensure the whole test matrix passes).
 - [ ] Replace `` in the title with the actual Jira issue
   number, if there is one.
 - [ ] If this contribution is large, please file an Apache
   [Individual Contributor License 
Agreement](https://www.apache.org/licenses/icla.txt).

---

Use for Root Transforms.

These transforms generate their own initial inputs, which the Evaluator
is responsible for providing back to them to generate elements from the
root PCollections.

Update ExecutorServiceParallelExecutor to schedule roots based on the
provided transforms.

Some tests which have become no longer relevant are deleted within this PR.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tgroh/incubator-beam initial_inputs

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-beam/pull/1051.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1051


commit 5b0fa8a8599cd0bce5ad60426206f6768b0b0d13
Author: Thomas Groh 
Date:   2016-09-30T23:28:35Z

Add RootTransformEvaluatorFactory

Use for Root Transforms.

These transforms generate their own initial inputs, which the Evaluator
is responsible for providing back to them to generate elements from the
root PCollections.

Update ExecutorServiceParallelExecutor to schedule roots based on the
provided transforms.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-beam pull request #1050: Makes FileBasedSink use a temporary direc...

2016-10-04 Thread jkff
GitHub user jkff opened a pull request:

https://github.com/apache/incubator-beam/pull/1050

Makes FileBasedSink use a temporary directory

When writing to `/path/to/foo`, temporary files would be written to 
`/path/too/foo-temp-$uid` (or something like that), i.e. as siblings of the 
final output. That could lead to issues like 
http://stackoverflow.com/q/39822859/278042

Now, temporary files are written to a path like: 
`/path/too/temp-beam-foo-$date/$uid`. This way, the temporary files won't match 
the same glob as the final output files (even though they may still fail to be 
deleted due to eventual consistency issues).

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jkff/incubator-beam file-sink-tmp-dir

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-beam/pull/1050.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1050


commit 1c34acdaf4a0c0697c9646934ac163788133347b
Author: Eugene Kirpichov 
Date:   2016-10-04T22:23:27Z

Makes FileBasedSink use a temporary directory

When writing to /path/to/foo, temporary files would be
written to /path/too/foo-temp-$uid (or something like that),
i.e. as siblings of the final output. That could lead
to issues like http://stackoverflow.com/q/39822859/278042

Now, temporary files are written to a path like:
/path/too/temp-beam-foo-$date/$uid.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (BEAM-498) Make DoFnWithContext the new DoFn

2016-10-04 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15546827#comment-15546827
 ] 

ASF GitHub Bot commented on BEAM-498:
-

GitHub user kennknowles opened a pull request:

https://github.com/apache/incubator-beam/pull/1049

[BEAM-498] Add DoFnInvoker for OldDoFn, for migration ease

Be sure to do all of the following to help us incorporate your contribution
quickly and easily:

 - [x] Make sure the PR title is formatted like:
   `[BEAM-] Description of pull request`
 - [x] Make sure tests pass via `mvn clean verify`. (Even better, enable
   Travis-CI on your fork and ensure the whole test matrix passes).
 - [x] Replace `` in the title with the actual Jira issue
   number, if there is one.
 - [x] If this contribution is large, please file an Apache
   [Individual Contributor License 
Agreement](https://www.apache.org/licenses/icla.txt).

---

This allows any runner to use `DoFnInvokers.invokerFor(Object)` to be 
agnostic as to whether they are running a `DoFn` or `OldDoFn`. Thus, the 
migration of the runner can occur in advance of further changes to the SDK and 
deployment can be independent. For example, a backend need not know whether it 
is deserializing a `DoFn` or `OldDoFn`.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/kennknowles/incubator-beam DoFnInvokers

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-beam/pull/1049.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1049


commit 52d74b9cc15268ad29d5f00a04843fa1aeda1c9e
Author: Kenneth Knowles 
Date:   2016-10-04T20:56:13Z

Add DoFnInvoker for OldDoFn, for migration ease

This allows any runner to use DoFnInvokers.invokerFor(Object) to be
agnostic as to whether they are running a DoFn or OldDoFn. Thus,
the migration of the runner can occur in advance of further changes
to the SDK and deployment can be independent. For example, a backend
need not know whether it is deserializing a DoFn or OldDoFn.




> Make DoFnWithContext the new DoFn
> -
>
> Key: BEAM-498
> URL: https://issues.apache.org/jira/browse/BEAM-498
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-java-core
>Reporter: Kenneth Knowles
>Assignee: Kenneth Knowles
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] incubator-beam pull request #1049: [BEAM-498] Add DoFnInvoker for OldDoFn, f...

2016-10-04 Thread kennknowles
GitHub user kennknowles opened a pull request:

https://github.com/apache/incubator-beam/pull/1049

[BEAM-498] Add DoFnInvoker for OldDoFn, for migration ease

Be sure to do all of the following to help us incorporate your contribution
quickly and easily:

 - [x] Make sure the PR title is formatted like:
   `[BEAM-] Description of pull request`
 - [x] Make sure tests pass via `mvn clean verify`. (Even better, enable
   Travis-CI on your fork and ensure the whole test matrix passes).
 - [x] Replace `` in the title with the actual Jira issue
   number, if there is one.
 - [x] If this contribution is large, please file an Apache
   [Individual Contributor License 
Agreement](https://www.apache.org/licenses/icla.txt).

---

This allows any runner to use `DoFnInvokers.invokerFor(Object)` to be 
agnostic as to whether they are running a `DoFn` or `OldDoFn`. Thus, the 
migration of the runner can occur in advance of further changes to the SDK and 
deployment can be independent. For example, a backend need not know whether it 
is deserializing a `DoFn` or `OldDoFn`.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/kennknowles/incubator-beam DoFnInvokers

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-beam/pull/1049.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1049


commit 52d74b9cc15268ad29d5f00a04843fa1aeda1c9e
Author: Kenneth Knowles 
Date:   2016-10-04T20:56:13Z

Add DoFnInvoker for OldDoFn, for migration ease

This allows any runner to use DoFnInvokers.invokerFor(Object) to be
agnostic as to whether they are running a DoFn or OldDoFn. Thus,
the migration of the runner can occur in advance of further changes
to the SDK and deployment can be independent. For example, a backend
need not know whether it is deserializing a DoFn or OldDoFn.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-beam pull request #1046: Add equality methods to range source.

2016-10-04 Thread robertwb
Github user robertwb closed the pull request at:

https://github.com/apache/incubator-beam/pull/1046


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[1/2] incubator-beam git commit: Add equality methods to range source.

2016-10-04 Thread robertwb
Repository: incubator-beam
Updated Branches:
  refs/heads/python-sdk 3a69db0c5 -> 99a33ecdb


Add equality methods to range source.


Project: http://git-wip-us.apache.org/repos/asf/incubator-beam/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-beam/commit/837c5aa3
Tree: http://git-wip-us.apache.org/repos/asf/incubator-beam/tree/837c5aa3
Diff: http://git-wip-us.apache.org/repos/asf/incubator-beam/diff/837c5aa3

Branch: refs/heads/python-sdk
Commit: 837c5aa31dd170a9c7e1ba6559e77457bf4a9f7f
Parents: 3a69db0
Author: Robert Bradshaw 
Authored: Tue Oct 4 13:12:08 2016 -0700
Committer: Robert Bradshaw 
Committed: Tue Oct 4 14:35:56 2016 -0700

--
 sdks/python/apache_beam/io/concat_source_test.py | 8 
 1 file changed, 8 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/incubator-beam/blob/837c5aa3/sdks/python/apache_beam/io/concat_source_test.py
--
diff --git a/sdks/python/apache_beam/io/concat_source_test.py 
b/sdks/python/apache_beam/io/concat_source_test.py
index 828bdb0..e4df472 100644
--- a/sdks/python/apache_beam/io/concat_source_test.py
+++ b/sdks/python/apache_beam/io/concat_source_test.py
@@ -70,6 +70,14 @@ class RangeSource(iobase.BoundedSource):
   return
   yield k
 
+  # For testing
+  def __eq__(self, other):
+return (type(self) == type(other)
+and self._start == other._start and self._end == other._end)
+
+  def __ne__(self, other):
+return not self == other
+
 
 class ConcatSourceTest(unittest.TestCase):
 



[2/2] incubator-beam git commit: Closes #1046

2016-10-04 Thread robertwb
Closes #1046


Project: http://git-wip-us.apache.org/repos/asf/incubator-beam/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-beam/commit/99a33ecd
Tree: http://git-wip-us.apache.org/repos/asf/incubator-beam/tree/99a33ecd
Diff: http://git-wip-us.apache.org/repos/asf/incubator-beam/diff/99a33ecd

Branch: refs/heads/python-sdk
Commit: 99a33ecdb68f66cbf98cdb145794df56ff469081
Parents: 3a69db0 837c5aa
Author: Robert Bradshaw 
Authored: Tue Oct 4 14:35:57 2016 -0700
Committer: Robert Bradshaw 
Committed: Tue Oct 4 14:35:57 2016 -0700

--
 sdks/python/apache_beam/io/concat_source_test.py | 8 
 1 file changed, 8 insertions(+)
--




[jira] [Commented] (BEAM-540) Dataflow streaming jobs running on windmill do not need data disks

2016-10-04 Thread Pei He (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15546638#comment-15546638
 ] 

Pei He commented on BEAM-540:
-

this can close?

> Dataflow streaming jobs running on windmill do not need data disks
> --
>
> Key: BEAM-540
> URL: https://issues.apache.org/jira/browse/BEAM-540
> Project: Beam
>  Issue Type: New Feature
>  Components: runner-dataflow
>Reporter: David Rieber
>Assignee: Davor Bonaci
>
> Dataflow streaming jobs running on windmill do not need data disks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (BEAM-702) Simple pattern for per-bundle and per-DoFn Closeable resources

2016-10-04 Thread Pei He (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pei He updated BEAM-702:

Component/s: beam-model

> Simple pattern for per-bundle and per-DoFn Closeable resources
> --
>
> Key: BEAM-702
> URL: https://issues.apache.org/jira/browse/BEAM-702
> Project: Beam
>  Issue Type: Improvement
>  Components: beam-model
>Reporter: Eugene Kirpichov
>
> Dealing with Closeable resources inside a processElement call is easy: simply 
> use try-with-resources.
> However, bundle- or DoFn-scoped resources, such as long-lived database 
> connections, are less convenient to deal with: you have to open them in 
> startBundle and conditionally close in finishBundle (likewise 
> setup/teardown), taking special care if there's multiple resources to close 
> all of them.
> Perhaps we should provide something like Guava's Closer to DoFn's 
> https://github.com/google/guava/wiki/ClosingResourcesExplained. Ideally, the 
> user would need to only write a startBundle() or setup() method, but not 
> write finishBundle() or teardown() - resources would be closed automatically.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] incubator-beam pull request #1048: Converts KafkaIO builders to @AutoValue

2016-10-04 Thread jkff
GitHub user jkff opened a pull request:

https://github.com/apache/incubator-beam/pull/1048

Converts KafkaIO builders to @AutoValue

This is in the same spirit as 
https://github.com/apache/incubator-beam/pull/1031, 
https://github.com/jbonofre/incubator-beam/pull/1 and 
https://github.com/apache/incubator-beam/pull/1033 . Semantics is unchanged 
AFAICT. The only user-visible change is that TypedRead and TypedWrite no longer 
exist (they were unnecessary in the first place) - see the trivial changes in 
test.

R: @rangadi 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jkff/incubator-beam kafka-autovalue

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-beam/pull/1048.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1048


commit ec3fdd030da1cc2ed2fdd0d16fad0396ef1855a9
Author: Eugene Kirpichov 
Date:   2016-09-29T22:30:10Z

Converts KafkaIO to AutoValue




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-beam pull request #1047: Little fixes to LatestFnTest

2016-10-04 Thread kennknowles
GitHub user kennknowles opened a pull request:

https://github.com/apache/incubator-beam/pull/1047

Little fixes to LatestFnTest

Be sure to do all of the following to help us incorporate your contribution
quickly and easily:

 - [x] Make sure the PR title is formatted like:
   `[BEAM-] Description of pull request`
 - [x] Make sure tests pass via `mvn clean verify`. (Even better, enable
   Travis-CI on your fork and ensure the whole test matrix passes).
 - [x] Replace `` in the title with the actual Jira issue
   number, if there is one.
 - [x] If this contribution is large, please file an Apache
   [Individual Contributor License 
Agreement](https://www.apache.org/licenses/icla.txt).

---

R: @swegner and any committer


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/kennknowles/incubator-beam latest-fn-test

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-beam/pull/1047.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1047


commit 8ae5861036f065e1240e319b7fe2aefe57376f21
Author: Kenneth Knowles 
Date:   2016-10-04T20:23:17Z

De-pluralize error message expectation in LatestFnTests

commit 9e4ba19082ee2d2f7276013e27cf9a5567436a98
Author: Kenneth Knowles 
Date:   2016-10-04T20:23:37Z

Move LatestFnTests to LatestFnTest




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Closed] (BEAM-545) Pipelines and their executions naming changes

2016-10-04 Thread Pei He (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pei He closed BEAM-545.
---
   Resolution: Fixed
Fix Version/s: 0.3.0-incubating

> Pipelines and their executions naming changes
> -
>
> Key: BEAM-545
> URL: https://issues.apache.org/jira/browse/BEAM-545
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-java-core
>Reporter: Pei He
>Assignee: Pei He
>Priority: Minor
> Fix For: 0.3.0-incubating
>
>
> The purpose of the changes is to clarify the differences between the two, have
> consensus between runners, and unify the implementation.
> Current states:
>  * PipelineOptions.appName defaults to mainClass name
>  * DataflowPipelineOptions.jobName defaults to appName+user+datetime
>  * FlinkPipelineOptions.jobName defaults to appName+user+datetime
> Proposal:
> 1. Replace PipelineOptions.appName with PipelineOptions.pipelineName.
> *  It is the user-visible name for a specific graph.
> *  default to mainClass name.
> *  Use cases: Find all executions of a pipeline
> 2. Add jobName to top level PipelineOptions.
> *  It is the unique name for an execution
> *  defaults to pipelineName + user + datetime + random Integer
> *  Use cases:
> -- Finding all executions by USER_A between TIME_X and TIME_Y
> -- Naming resources created by the execution. for example:



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (BEAM-545) Pipelines and their executions naming changes

2016-10-04 Thread Pei He (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pei He updated BEAM-545:

Component/s: sdk-java-core

> Pipelines and their executions naming changes
> -
>
> Key: BEAM-545
> URL: https://issues.apache.org/jira/browse/BEAM-545
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-java-core
>Reporter: Pei He
>Assignee: Pei He
>Priority: Minor
>
> The purpose of the changes is to clarify the differences between the two, have
> consensus between runners, and unify the implementation.
> Current states:
>  * PipelineOptions.appName defaults to mainClass name
>  * DataflowPipelineOptions.jobName defaults to appName+user+datetime
>  * FlinkPipelineOptions.jobName defaults to appName+user+datetime
> Proposal:
> 1. Replace PipelineOptions.appName with PipelineOptions.pipelineName.
> *  It is the user-visible name for a specific graph.
> *  default to mainClass name.
> *  Use cases: Find all executions of a pipeline
> 2. Add jobName to top level PipelineOptions.
> *  It is the unique name for an execution
> *  defaults to pipelineName + user + datetime + random Integer
> *  Use cases:
> -- Finding all executions by USER_A between TIME_X and TIME_Y
> -- Naming resources created by the execution. for example:



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Issue Comment Deleted] (BEAM-696) Side-Inputs non-deterministic with merging main-input windows

2016-10-04 Thread Amit Sela (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amit Sela updated BEAM-696:
---
Comment: was deleted

(was: So it runs once, on the merged window ?
That happens in the bundle level, correct ? Do bundles always behave the same ? 
in terms of #elements they hold on to ?
If not, and a side input is called on the merged windows, won't the sideInput 
value be affected from things like network or something else that may affect 
the bundle's contents ? and if the bundle always holds just one element, then 
merging never happens, correct ?

I find this a bit confusing, I think the problem here has to do with the fact 
that applying a CombineFn with SideInputs on any PCollection is problematic. 
Sessions seem to be handled as any other BoundedWindow for that matter, but 
they are not..
BTW, isn't this: 
https://github.com/apache/incubator-beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/windowing/Sessions.java#L84
 saying that sideInputs are not allowed on Sessions ? Fact is that they are 
allowed, but what does that mean ?  )

> Side-Inputs non-deterministic with merging main-input windows
> -
>
> Key: BEAM-696
> URL: https://issues.apache.org/jira/browse/BEAM-696
> Project: Beam
>  Issue Type: Bug
>  Components: beam-model
>Reporter: Ben Chambers
>Assignee: Pei He
>
> Side-Inputs are non-deterministic for several reasons:
> 1. Because they depend on triggering of the side-input (this is acceptable 
> because triggers are by their nature non-deterministic).
> 2. They depend on the current state of the main-input window in order to 
> lookup the side-input. This means that with merging
> 3. Any runner optimizations that affect when the side-input is looked up may 
> cause problems with either or both of these.
> This issue focuses on #2 -- the non-determinism of side-inputs that execute 
> within a Merging WindowFn.
> Possible solution would be to defer running anything that looks up the 
> side-input until we need to extract an output, and using the main-window at 
> that point. Specifically, if the main-window is a MergingWindowFn, don't 
> execute any kind of pre-combine, instead buffer all the inputs and combine 
> later.
> This could still run into some non-determinism if there are triggers 
> controlling when we extract output.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (BEAM-433) Make Beam examples runners agnostic

2016-10-04 Thread Pei He (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pei He closed BEAM-433.
---
   Resolution: Fixed
Fix Version/s: 0.3.0-incubating

> Make Beam examples runners agnostic
> ---
>
> Key: BEAM-433
> URL: https://issues.apache.org/jira/browse/BEAM-433
> Project: Beam
>  Issue Type: Improvement
>  Components: examples-java
>Reporter: Pei He
>Assignee: Pei He
> Fix For: 0.3.0-incubating
>
>
> Beam examples are ported from Dataflow, and they heavily reference to 
> Dataflow classes.
> There are following cleanup tasks:
> 1. Remove Dataflow streaming and batch injector setup (Done).
> 2. Remove references to DataflowPipelineOptions.
> 3. Move cancel() from DataflowPipelineJob to PipelineResult.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (BEAM-696) Side-Inputs non-deterministic with merging main-input windows

2016-10-04 Thread Amit Sela (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15546539#comment-15546539
 ] 

Amit Sela commented on BEAM-696:


So it runs once, on the merged window ?
That happens in the bundle level, correct ? Do bundles always behave the same ? 
in terms of #elements they hold on to ?
If not, and a side input is called on the merged windows, won't the sideInput 
value be affected from things like network or something else that may affect 
the bundle's contents ? and if the bundle always holds just one element, then 
merging never happens, correct ?

I find this a bit confusing, I think the problem here has to do with the fact 
that applying a CombineFn with SideInputs on any PCollection is problematic. 
Sessions seem to be handled as any other BoundedWindow for that matter, but 
they are not..
BTW, isn't this: 
https://github.com/apache/incubator-beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/windowing/Sessions.java#L84
 saying that sideInputs are not allowed on Sessions ? Fact is that they are 
allowed, but what does that mean ?  

> Side-Inputs non-deterministic with merging main-input windows
> -
>
> Key: BEAM-696
> URL: https://issues.apache.org/jira/browse/BEAM-696
> Project: Beam
>  Issue Type: Bug
>  Components: beam-model
>Reporter: Ben Chambers
>Assignee: Pei He
>
> Side-Inputs are non-deterministic for several reasons:
> 1. Because they depend on triggering of the side-input (this is acceptable 
> because triggers are by their nature non-deterministic).
> 2. They depend on the current state of the main-input window in order to 
> lookup the side-input. This means that with merging
> 3. Any runner optimizations that affect when the side-input is looked up may 
> cause problems with either or both of these.
> This issue focuses on #2 -- the non-determinism of side-inputs that execute 
> within a Merging WindowFn.
> Possible solution would be to defer running anything that looks up the 
> side-input until we need to extract an output, and using the main-window at 
> that point. Specifically, if the main-window is a MergingWindowFn, don't 
> execute any kind of pre-combine, instead buffer all the inputs and combine 
> later.
> This could still run into some non-determinism if there are triggers 
> controlling when we extract output.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (BEAM-696) Side-Inputs non-deterministic with merging main-input windows

2016-10-04 Thread Amit Sela (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15546538#comment-15546538
 ] 

Amit Sela commented on BEAM-696:


So it runs once, on the merged window ?
That happens in the bundle level, correct ? Do bundles always behave the same ? 
in terms of #elements they hold on to ?
If not, and a side input is called on the merged windows, won't the sideInput 
value be affected from things like network or something else that may affect 
the bundle's contents ? and if the bundle always holds just one element, then 
merging never happens, correct ?

I find this a bit confusing, I think the problem here has to do with the fact 
that applying a CombineFn with SideInputs on any PCollection is problematic. 
Sessions seem to be handled as any other BoundedWindow for that matter, but 
they are not..
BTW, isn't this: 
https://github.com/apache/incubator-beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/windowing/Sessions.java#L84
 saying that sideInputs are not allowed on Sessions ? Fact is that they are 
allowed, but what does that mean ?  

> Side-Inputs non-deterministic with merging main-input windows
> -
>
> Key: BEAM-696
> URL: https://issues.apache.org/jira/browse/BEAM-696
> Project: Beam
>  Issue Type: Bug
>  Components: beam-model
>Reporter: Ben Chambers
>Assignee: Pei He
>
> Side-Inputs are non-deterministic for several reasons:
> 1. Because they depend on triggering of the side-input (this is acceptable 
> because triggers are by their nature non-deterministic).
> 2. They depend on the current state of the main-input window in order to 
> lookup the side-input. This means that with merging
> 3. Any runner optimizations that affect when the side-input is looked up may 
> cause problems with either or both of these.
> This issue focuses on #2 -- the non-determinism of side-inputs that execute 
> within a Merging WindowFn.
> Possible solution would be to defer running anything that looks up the 
> side-input until we need to extract an output, and using the main-window at 
> that point. Specifically, if the main-window is a MergingWindowFn, don't 
> execute any kind of pre-combine, instead buffer all the inputs and combine 
> later.
> This could still run into some non-determinism if there are triggers 
> controlling when we extract output.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (BEAM-483) Generated job names easily collide

2016-10-04 Thread Pei He (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pei He closed BEAM-483.
---
   Resolution: Fixed
Fix Version/s: 0.3.0-incubating

> Generated job names easily collide
> --
>
> Key: BEAM-483
> URL: https://issues.apache.org/jira/browse/BEAM-483
> Project: Beam
>  Issue Type: Bug
>  Components: runner-dataflow
>Reporter: Pei He
>Assignee: Pei He
>Priority: Minor
> Fix For: 0.3.0-incubating
>
>
> The current job name generation scheme may easily lead to duplicate job names 
> and cause DataflowJobAlreadyExistsException, especially when a series of jobs 
> are submitted at the same time. (e.g., from a single script).
> It would be better to just add a random suffix like "-3275" or "-x1bh".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (BEAM-554) Dataflow runner to support bounded writes in streaming mode.

2016-10-04 Thread Pei He (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pei He resolved BEAM-554.
-
   Resolution: Fixed
Fix Version/s: 0.3.0-incubating

> Dataflow runner to support bounded writes in streaming mode.
> 
>
> Key: BEAM-554
> URL: https://issues.apache.org/jira/browse/BEAM-554
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-dataflow
>Reporter: Pei He
>Assignee: Pei He
> Fix For: 0.3.0-incubating
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (BEAM-572) Remove Spark references in WordCount

2016-10-04 Thread Pei He (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pei He resolved BEAM-572.
-
   Resolution: Fixed
Fix Version/s: 0.3.0-incubating

> Remove Spark references in WordCount
> 
>
> Key: BEAM-572
> URL: https://issues.apache.org/jira/browse/BEAM-572
> Project: Beam
>  Issue Type: Bug
>  Components: examples-java
>Reporter: Pei He
>Assignee: Mark Liu
> Fix For: 0.3.0-incubating
>
>
> Examples should be runner agnostics.
> We don't want to have Spark references in
> https://github.com/apache/incubator-beam/blob/master/examples/java/src/main/java/org/apache/beam/examples/WordCount.java



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] incubator-beam pull request #1046: Add equality methods to range source.

2016-10-04 Thread robertwb
GitHub user robertwb opened a pull request:

https://github.com/apache/incubator-beam/pull/1046

Add equality methods to range source.

Be sure to do all of the following to help us incorporate your contribution
quickly and easily:

 - [ ] Make sure the PR title is formatted like:
   `[BEAM-] Description of pull request`
 - [ ] Make sure tests pass via `mvn clean verify`. (Even better, enable
   Travis-CI on your fork and ensure the whole test matrix passes).
 - [ ] Replace `` in the title with the actual Jira issue
   number, if there is one.
 - [ ] If this contribution is large, please file an Apache
   [Individual Contributor License 
Agreement](https://www.apache.org/licenses/icla.txt).

---



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/robertwb/incubator-beam patch-2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-beam/pull/1046.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1046


commit 602f269bc9b6c198d9c79a006cca70528e696153
Author: Robert Bradshaw 
Date:   2016-10-04T20:12:08Z

Add equality methods to range source.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (BEAM-586) CombineFns with no inputs should produce no outputs when used as a main input

2016-10-04 Thread Pei He (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15546500#comment-15546500
 ] 

Pei He commented on BEAM-586:
-

Is that fair to say triggering defines whether and when CombineFns output, and 
CombineFns define what to output given a collection of inputs (could be empty)?

If this is what we want to do (I am not sure if it is right), possible options 
are:
1. triggering doesn't output empty pane.
or 2. CombineFns output nothing for an empty input collection.

I would prefer not to differentiate main input and side input.

> CombineFns with no inputs should produce no outputs when used as a main input
> -
>
> Key: BEAM-586
> URL: https://issues.apache.org/jira/browse/BEAM-586
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core
>Reporter: Thomas Groh
>
> TestGloballyEmptyCollection seems to be violating this assumption
> https://github.com/apache/incubator-beam/pull/862/files#diff-a305819710e8d79d2b045d6416184f21R65
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (BEAM-561) Add WindowedWordCountIT

2016-10-04 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15546489#comment-15546489
 ] 

ASF GitHub Bot commented on BEAM-561:
-

GitHub user markflyhigh opened a pull request:

https://github.com/apache/incubator-beam/pull/1045

[BEAM-561] Add Streaming IT to Jenkins Pre-commit

Be sure to do all of the following to help us incorporate your contribution
quickly and easily:

 - [x] Make sure the PR title is formatted like:
   `[BEAM-] Description of pull request`
 - [x] Make sure tests pass via `mvn clean verify`. (Even better, enable
   Travis-CI on your fork and ensure the whole test matrix passes).
 - [x] Replace `` in the title with the actual Jira issue
   number, if there is one.
 - [ ] If this contribution is large, please file an Apache
   [Individual Contributor License 
Agreement](https://www.apache.org/licenses/icla.txt).

---

Jenkins can add this profile to precommit and WindowedWordCountIT will run 
in both batch and streaming mode in each build. This change increases the 
coverage of streaming pipeline as well as BigQuery. Currently, only 
DirectRunner and TestDataflowRunner is applied. TestSparkRunner and 
TestFlinkRunner will be added later once they fully support streaming 
integration tests.

Test is done by running following command with different configs:
```
mvn clean verify -P jenkins-precommit-streaming,include-runners
```

TODO:
Jenkins pre-commit command need to be updated by adding 
"jenkins-precommit-streaming" flag.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/markflyhigh/incubator-beam 
jenkins-precommit-streaming

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-beam/pull/1045.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1045


commit 3cf06363355d9d35bf3834ef97c4b0ee6b2f97a5
Author: Mark Liu 
Date:   2016-10-04T19:43:46Z

[BEAM-561] Add Streaming IT in Jenkins Pre-commit




> Add WindowedWordCountIT
> ---
>
> Key: BEAM-561
> URL: https://issues.apache.org/jira/browse/BEAM-561
> Project: Beam
>  Issue Type: Bug
>Reporter: Jason Kuster
>Assignee: Mark Liu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] incubator-beam pull request #1045: [BEAM-561] Add Streaming IT to Jenkins Pr...

2016-10-04 Thread markflyhigh
GitHub user markflyhigh opened a pull request:

https://github.com/apache/incubator-beam/pull/1045

[BEAM-561] Add Streaming IT to Jenkins Pre-commit

Be sure to do all of the following to help us incorporate your contribution
quickly and easily:

 - [x] Make sure the PR title is formatted like:
   `[BEAM-] Description of pull request`
 - [x] Make sure tests pass via `mvn clean verify`. (Even better, enable
   Travis-CI on your fork and ensure the whole test matrix passes).
 - [x] Replace `` in the title with the actual Jira issue
   number, if there is one.
 - [ ] If this contribution is large, please file an Apache
   [Individual Contributor License 
Agreement](https://www.apache.org/licenses/icla.txt).

---

Jenkins can add this profile to precommit and WindowedWordCountIT will run 
in both batch and streaming mode in each build. This change increases the 
coverage of streaming pipeline as well as BigQuery. Currently, only 
DirectRunner and TestDataflowRunner is applied. TestSparkRunner and 
TestFlinkRunner will be added later once they fully support streaming 
integration tests.

Test is done by running following command with different configs:
```
mvn clean verify -P jenkins-precommit-streaming,include-runners
```

TODO:
Jenkins pre-commit command need to be updated by adding 
"jenkins-precommit-streaming" flag.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/markflyhigh/incubator-beam 
jenkins-precommit-streaming

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-beam/pull/1045.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1045


commit 3cf06363355d9d35bf3834ef97c4b0ee6b2f97a5
Author: Mark Liu 
Date:   2016-10-04T19:43:46Z

[BEAM-561] Add Streaming IT in Jenkins Pre-commit




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (BEAM-614) File compression/decompression should support auto detection

2016-10-04 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15546481#comment-15546481
 ] 

ASF GitHub Bot commented on BEAM-614:
-

Github user asfgit closed the pull request at:

https://github.com/apache/incubator-beam/pull/1002


> File compression/decompression should support auto detection
> 
>
> Key: BEAM-614
> URL: https://issues.apache.org/jira/browse/BEAM-614
> Project: Beam
>  Issue Type: Bug
>Reporter: Slaven Bilac
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] incubator-beam pull request #1002: [BEAM-614] Updates FileBasedSource to sup...

2016-10-04 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/incubator-beam/pull/1002


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[2/2] incubator-beam git commit: Closes #1002

2016-10-04 Thread robertwb
Closes #1002


Project: http://git-wip-us.apache.org/repos/asf/incubator-beam/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-beam/commit/3a69db0c
Tree: http://git-wip-us.apache.org/repos/asf/incubator-beam/tree/3a69db0c
Diff: http://git-wip-us.apache.org/repos/asf/incubator-beam/diff/3a69db0c

Branch: refs/heads/python-sdk
Commit: 3a69db0c5bb9321c2082831b5c00d778ddf1b1d7
Parents: 731a771 2126a34
Author: Robert Bradshaw 
Authored: Tue Oct 4 13:00:21 2016 -0700
Committer: Robert Bradshaw 
Committed: Tue Oct 4 13:00:21 2016 -0700

--
 sdks/python/apache_beam/io/filebasedsource.py   |  44 
 .../apache_beam/io/filebasedsource_test.py  | 104 +--
 2 files changed, 121 insertions(+), 27 deletions(-)
--




[1/2] incubator-beam git commit: Updates filebasedsource to support CompressionType.AUTO.

2016-10-04 Thread robertwb
Repository: incubator-beam
Updated Branches:
  refs/heads/python-sdk 731a77152 -> 3a69db0c5


Updates filebasedsource to support CompressionType.AUTO.


Project: http://git-wip-us.apache.org/repos/asf/incubator-beam/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-beam/commit/2126a34c
Tree: http://git-wip-us.apache.org/repos/asf/incubator-beam/tree/2126a34c
Diff: http://git-wip-us.apache.org/repos/asf/incubator-beam/diff/2126a34c

Branch: refs/heads/python-sdk
Commit: 2126a34c06d74c5ad44fbec8dd4e278f99ed473a
Parents: 731a771
Author: chamik...@google.com 
Authored: Sun Sep 25 21:44:34 2016 -0700
Committer: chamik...@google.com 
Committed: Mon Oct 3 10:20:46 2016 -0700

--
 sdks/python/apache_beam/io/filebasedsource.py   |  44 
 .../apache_beam/io/filebasedsource_test.py  | 104 +--
 2 files changed, 121 insertions(+), 27 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/incubator-beam/blob/2126a34c/sdks/python/apache_beam/io/filebasedsource.py
--
diff --git a/sdks/python/apache_beam/io/filebasedsource.py 
b/sdks/python/apache_beam/io/filebasedsource.py
index 8ff69ca..e067833 100644
--- a/sdks/python/apache_beam/io/filebasedsource.py
+++ b/sdks/python/apache_beam/io/filebasedsource.py
@@ -42,8 +42,7 @@ class FileBasedSource(iobase.BoundedSource):
   def __init__(self,
file_pattern,
min_bundle_size=0,
-   # TODO(BEAM-614)
-   compression_type=fileio.CompressionTypes.UNCOMPRESSED,
+   compression_type=fileio.CompressionTypes.AUTO,
splittable=True):
 """Initializes ``FileBasedSource``.
 
@@ -72,13 +71,6 @@ class FileBasedSource(iobase.BoundedSource):
   '%s: file_pattern must be a string;  got %r instead' %
   (self.__class__.__name__, file_pattern))
 
-if compression_type == fileio.CompressionTypes.AUTO:
-  raise ValueError('FileBasedSource currently does not support '
-   'CompressionTypes.AUTO. Please explicitly specify the '
-   'compression type or use '
-   'CompressionTypes.UNCOMPRESSED if file is '
-   'uncompressed.')
-
 self._pattern = file_pattern
 self._concat_source = None
 self._min_bundle_size = min_bundle_size
@@ -86,11 +78,12 @@ class FileBasedSource(iobase.BoundedSource):
   raise TypeError('compression_type must be CompressionType object but '
   'was %s' % type(compression_type))
 self._compression_type = compression_type
-if compression_type != fileio.CompressionTypes.UNCOMPRESSED:
+if compression_type in (fileio.CompressionTypes.UNCOMPRESSED,
+fileio.CompressionTypes.AUTO):
+  self._splittable = splittable
+else:
   # We can't split compressed files efficiently so turn off splitting.
   self._splittable = False
-else:
-  self._splittable = splittable
 
   def _get_concat_source(self):
 if self._concat_source is None:
@@ -102,11 +95,21 @@ class FileBasedSource(iobase.BoundedSource):
 if sizes[index] == 0:
   continue  # Ignoring empty file.
 
+# We determine splittability of this specific file.
+splittable = self.splittable
+if (splittable and
+self._compression_type == fileio.CompressionTypes.AUTO):
+  compression_type = fileio.CompressionTypes.detect_compression_type(
+  file_name)
+  if compression_type != fileio.CompressionTypes.UNCOMPRESSED:
+splittable = False
+
 single_file_source = _SingleFileSource(
 self, file_name,
 0,
 sizes[index],
-min_bundle_size=self._min_bundle_size)
+min_bundle_size=self._min_bundle_size,
+splittable=splittable)
 single_file_sources.append(single_file_source)
   self._concat_source = concat_source.ConcatSource(single_file_sources)
 return self._concat_source
@@ -173,7 +176,7 @@ class _SingleFileSource(iobase.BoundedSource):
   """Denotes a source for a specific file type."""
 
   def __init__(self, file_based_source, file_name, start_offset, stop_offset,
-   min_bundle_size=0):
+   min_bundle_size=0, splittable=True):
 if not isinstance(start_offset, (int, long)):
   raise TypeError(
   'start_offset must be a number. Received: %r' % start_offset)
@@ -193,6 +196,7 @@ class _SingleFileSource(iobase.BoundedSource):
 self._stop_offset = stop_offset
 self._min_bundle_size = min_bundle_size
 self._file_based_source = file_based_source
+self._splittable = splittable
 
   def split(self, desired_bundle_size, start_offset=None, 

[jira] [Commented] (BEAM-696) Side-Inputs non-deterministic with merging main-input windows

2016-10-04 Thread Kenneth Knowles (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15546473#comment-15546473
 ] 

Kenneth Knowles commented on BEAM-696:
--

Ben points out (in person) that this latter spec is already the spec, because 
Combine is not a model primitive transform: Combine.perKey expands to 
"GroupByKey and then combine the groups with ParDo over the iterable". So 
semantically, it occurs in one single call to {{processElement}}, over an 
iterable that is in a single main input window, and reads the side input at 
just once point in its evolution.

> Side-Inputs non-deterministic with merging main-input windows
> -
>
> Key: BEAM-696
> URL: https://issues.apache.org/jira/browse/BEAM-696
> Project: Beam
>  Issue Type: Bug
>  Components: beam-model
>Reporter: Ben Chambers
>Assignee: Pei He
>
> Side-Inputs are non-deterministic for several reasons:
> 1. Because they depend on triggering of the side-input (this is acceptable 
> because triggers are by their nature non-deterministic).
> 2. They depend on the current state of the main-input window in order to 
> lookup the side-input. This means that with merging
> 3. Any runner optimizations that affect when the side-input is looked up may 
> cause problems with either or both of these.
> This issue focuses on #2 -- the non-determinism of side-inputs that execute 
> within a Merging WindowFn.
> Possible solution would be to defer running anything that looks up the 
> side-input until we need to extract an output, and using the main-window at 
> that point. Specifically, if the main-window is a MergingWindowFn, don't 
> execute any kind of pre-combine, instead buffer all the inputs and combine 
> later.
> This could still run into some non-determinism if there are triggers 
> controlling when we extract output.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (BEAM-605) Create BigQuery Verifier

2016-10-04 Thread Pei He (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pei He updated BEAM-605:

Affects Version/s: Not applicable
  Component/s: testing

> Create BigQuery Verifier
> 
>
> Key: BEAM-605
> URL: https://issues.apache.org/jira/browse/BEAM-605
> Project: Beam
>  Issue Type: Task
>  Components: testing
>Affects Versions: Not applicable
>Reporter: Mark Liu
>Assignee: Mark Liu
>
> Create BigQuery verifier that is used to verify output of integration test 
> which is using BigQuery as output source. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Issue Comment Deleted] (BEAM-702) Simple pattern for per-bundle and per-DoFn Closeable resources

2016-10-04 Thread Pei He (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pei He updated BEAM-702:

Comment: was deleted

(was: 1. If there are failures in one bundle, we can easily call teardown()s of 
DoFns associated with this bundle. (I think we had plan to do it, I am not sure 
it has been done or not.)
The tricky part is that if there are failures in other PTransforms or in the 
environment while we are processing the bundles. Before runners fail the whole 
job, we want runners be able to call teardown() for all running ParDos and 
their bundles in processing. We don't have a plan for this part yet.

2. We should document and clarify whether finishBundle() will be called or not 
if the bundle fails.)

> Simple pattern for per-bundle and per-DoFn Closeable resources
> --
>
> Key: BEAM-702
> URL: https://issues.apache.org/jira/browse/BEAM-702
> Project: Beam
>  Issue Type: Improvement
>Reporter: Eugene Kirpichov
>
> Dealing with Closeable resources inside a processElement call is easy: simply 
> use try-with-resources.
> However, bundle- or DoFn-scoped resources, such as long-lived database 
> connections, are less convenient to deal with: you have to open them in 
> startBundle and conditionally close in finishBundle (likewise 
> setup/teardown), taking special care if there's multiple resources to close 
> all of them.
> Perhaps we should provide something like Guava's Closer to DoFn's 
> https://github.com/google/guava/wiki/ClosingResourcesExplained. Ideally, the 
> user would need to only write a startBundle() or setup() method, but not 
> write finishBundle() or teardown() - resources would be closed automatically.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (BEAM-702) Simple pattern for per-bundle and per-DoFn Closeable resources

2016-10-04 Thread Pei He (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15546412#comment-15546412
 ] 

Pei He commented on BEAM-702:
-

1. If there are failures in one bundle, we can easily call teardown()s of DoFns 
associated with this bundle. (I think we had plan to do it, I am not sure it 
has been done or not.)
The tricky part is that if there are failures in other PTransforms or in the 
environment while we are processing the bundles. Before runners fail the whole 
job, we want runners be able to call teardown() for all running ParDos and 
their bundles in processing. We don't have a plan for this part yet.

2. We should document and clarify whether finishBundle() will be called or not 
if the bundle fails.

> Simple pattern for per-bundle and per-DoFn Closeable resources
> --
>
> Key: BEAM-702
> URL: https://issues.apache.org/jira/browse/BEAM-702
> Project: Beam
>  Issue Type: Improvement
>Reporter: Eugene Kirpichov
>
> Dealing with Closeable resources inside a processElement call is easy: simply 
> use try-with-resources.
> However, bundle- or DoFn-scoped resources, such as long-lived database 
> connections, are less convenient to deal with: you have to open them in 
> startBundle and conditionally close in finishBundle (likewise 
> setup/teardown), taking special care if there's multiple resources to close 
> all of them.
> Perhaps we should provide something like Guava's Closer to DoFn's 
> https://github.com/google/guava/wiki/ClosingResourcesExplained. Ideally, the 
> user would need to only write a startBundle() or setup() method, but not 
> write finishBundle() or teardown() - resources would be closed automatically.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (BEAM-702) Simple pattern for per-bundle and per-DoFn Closeable resources

2016-10-04 Thread Pei He (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15546411#comment-15546411
 ] 

Pei He commented on BEAM-702:
-

1. If there are failures in one bundle, we can easily call teardown()s of DoFns 
associated with this bundle. (I think we had plan to do it, I am not sure it 
has been done or not.)
The tricky part is that if there are failures in other PTransforms or in the 
environment while we are processing the bundles. Before runners fail the whole 
job, we want runners be able to call teardown() for all running ParDos and 
their bundles in processing. We don't have a plan for this part yet.

2. We should document and clarify whether finishBundle() will be called or not 
if the bundle fails.

> Simple pattern for per-bundle and per-DoFn Closeable resources
> --
>
> Key: BEAM-702
> URL: https://issues.apache.org/jira/browse/BEAM-702
> Project: Beam
>  Issue Type: Improvement
>Reporter: Eugene Kirpichov
>
> Dealing with Closeable resources inside a processElement call is easy: simply 
> use try-with-resources.
> However, bundle- or DoFn-scoped resources, such as long-lived database 
> connections, are less convenient to deal with: you have to open them in 
> startBundle and conditionally close in finishBundle (likewise 
> setup/teardown), taking special care if there's multiple resources to close 
> all of them.
> Perhaps we should provide something like Guava's Closer to DoFn's 
> https://github.com/google/guava/wiki/ClosingResourcesExplained. Ideally, the 
> user would need to only write a startBundle() or setup() method, but not 
> write finishBundle() or teardown() - resources would be closed automatically.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (BEAM-703) SingletonViewFn might exhaust defaultValue if it's serialized after being used.

2016-10-04 Thread Kenneth Knowles (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kenneth Knowles resolved BEAM-703.
--
   Resolution: Fixed
Fix Version/s: 0.3.0-incubating

> SingletonViewFn might exhaust defaultValue if it's serialized after being 
> used. 
> 
>
> Key: BEAM-703
> URL: https://issues.apache.org/jira/browse/BEAM-703
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core
>Reporter: Amit Sela
>Assignee: Amit Sela
>Priority: Minor
> Fix For: 0.3.0-incubating
>
>
> In 
> https://github.com/apache/incubator-beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/util/PCollectionViews.java#L267
>  the defaultValue is set to null to avoid decoding over and over I assume.
> If the defaultValue is accessed before the the SingletonViewFn is serialized, 
> it will exhaust the encoded value (assigned with null) while losing the 
> transient decoded value.
> It'd probably be best to simply check if defaultValue is null before 
> decoding, so that decode will still happen just once, but the encoded data is 
> not lost. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] incubator-beam pull request #1040: [BEAM-703] SingletonViewFn might exhaust ...

2016-10-04 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/incubator-beam/pull/1040


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (BEAM-703) SingletonViewFn might exhaust defaultValue if it's serialized after being used.

2016-10-04 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15546394#comment-15546394
 ] 

ASF GitHub Bot commented on BEAM-703:
-

Github user asfgit closed the pull request at:

https://github.com/apache/incubator-beam/pull/1040


> SingletonViewFn might exhaust defaultValue if it's serialized after being 
> used. 
> 
>
> Key: BEAM-703
> URL: https://issues.apache.org/jira/browse/BEAM-703
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core
>Reporter: Amit Sela
>Assignee: Amit Sela
>Priority: Minor
>
> In 
> https://github.com/apache/incubator-beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/util/PCollectionViews.java#L267
>  the defaultValue is set to null to avoid decoding over and over I assume.
> If the defaultValue is accessed before the the SingletonViewFn is serialized, 
> it will exhaust the encoded value (assigned with null) while losing the 
> transient decoded value.
> It'd probably be best to simply check if defaultValue is null before 
> decoding, so that decode will still happen just once, but the encoded data is 
> not lost. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[1/2] incubator-beam git commit: Avoid losing the encoded defaultValue.

2016-10-04 Thread kenn
Repository: incubator-beam
Updated Branches:
  refs/heads/master 8462acbcb -> 087dcef1e


Avoid losing the encoded defaultValue.


Project: http://git-wip-us.apache.org/repos/asf/incubator-beam/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-beam/commit/0bd97efc
Tree: http://git-wip-us.apache.org/repos/asf/incubator-beam/tree/0bd97efc
Diff: http://git-wip-us.apache.org/repos/asf/incubator-beam/diff/0bd97efc

Branch: refs/heads/master
Commit: 0bd97efc0d764af17cdd8abdf43bff33bb21be2b
Parents: 8462acb
Author: Sela 
Authored: Tue Oct 4 18:12:38 2016 +0300
Committer: Sela 
Committed: Tue Oct 4 18:12:38 2016 +0300

--
 .../src/main/java/org/apache/beam/sdk/util/PCollectionViews.java  | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/incubator-beam/blob/0bd97efc/sdks/java/core/src/main/java/org/apache/beam/sdk/util/PCollectionViews.java
--
diff --git 
a/sdks/java/core/src/main/java/org/apache/beam/sdk/util/PCollectionViews.java 
b/sdks/java/core/src/main/java/org/apache/beam/sdk/util/PCollectionViews.java
index 14ae5c8..3b1fde9 100644
--- 
a/sdks/java/core/src/main/java/org/apache/beam/sdk/util/PCollectionViews.java
+++ 
b/sdks/java/core/src/main/java/org/apache/beam/sdk/util/PCollectionViews.java
@@ -261,10 +261,9 @@ public class PCollectionViews {
   }
   // Lazily decode the default value once
   synchronized (this) {
-if (encodedDefaultValue != null) {
+if (encodedDefaultValue != null && defaultValue == null) {
   try {
 defaultValue = CoderUtils.decodeFromByteArray(valueCoder, 
encodedDefaultValue);
-encodedDefaultValue = null;
   } catch (IOException e) {
 throw new RuntimeException("Unexpected IOException: ", e);
   }



[2/2] incubator-beam git commit: This closes #1040

2016-10-04 Thread kenn
This closes #1040


Project: http://git-wip-us.apache.org/repos/asf/incubator-beam/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-beam/commit/087dcef1
Tree: http://git-wip-us.apache.org/repos/asf/incubator-beam/tree/087dcef1
Diff: http://git-wip-us.apache.org/repos/asf/incubator-beam/diff/087dcef1

Branch: refs/heads/master
Commit: 087dcef1efc6b7765e203f8aebfd3caf8c4a77de
Parents: 8462acb 0bd97ef
Author: Kenneth Knowles 
Authored: Tue Oct 4 12:27:22 2016 -0700
Committer: Kenneth Knowles 
Committed: Tue Oct 4 12:27:22 2016 -0700

--
 .../src/main/java/org/apache/beam/sdk/util/PCollectionViews.java  | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)
--




[jira] [Commented] (BEAM-702) Simple pattern for per-bundle and per-DoFn Closeable resources

2016-10-04 Thread Eugene Kirpichov (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15546300#comment-15546300
 ] 

Eugene Kirpichov commented on BEAM-702:
---

Hmm, I didn't realize that we don't call finishBundle and teardown in case the 
bundle fails. But yes, this makes a lot of sense.

I'm not sure whether I personally would prefer annotation-based closeables or 
explicit calls (e.g. c.addCloseable(createDBWriter)). Explicit calls would make 
it easier to open a dynamic set of resources (e.g. lazily open connections to 
different shards of a database depending on the data). However that could be 
encapsulated into a single Closeable object, making these styles equivalent.

> Simple pattern for per-bundle and per-DoFn Closeable resources
> --
>
> Key: BEAM-702
> URL: https://issues.apache.org/jira/browse/BEAM-702
> Project: Beam
>  Issue Type: Improvement
>Reporter: Eugene Kirpichov
>
> Dealing with Closeable resources inside a processElement call is easy: simply 
> use try-with-resources.
> However, bundle- or DoFn-scoped resources, such as long-lived database 
> connections, are less convenient to deal with: you have to open them in 
> startBundle and conditionally close in finishBundle (likewise 
> setup/teardown), taking special care if there's multiple resources to close 
> all of them.
> Perhaps we should provide something like Guava's Closer to DoFn's 
> https://github.com/google/guava/wiki/ClosingResourcesExplained. Ideally, the 
> user would need to only write a startBundle() or setup() method, but not 
> write finishBundle() or teardown() - resources would be closed automatically.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (BEAM-25) Add user-ready API for interacting with state

2016-10-04 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15546294#comment-15546294
 ] 

ASF GitHub Bot commented on BEAM-25:


GitHub user kennknowles opened a pull request:

https://github.com/apache/incubator-beam/pull/1044

[BEAM-25] Refactor StateSpec out of StateTag

Be sure to do all of the following to help us incorporate your contribution
quickly and easily:

 - [x] Make sure the PR title is formatted like:
   `[BEAM-] Description of pull request`
 - [ ] Make sure tests pass via `mvn clean verify`. (Even better, enable
   Travis-CI on your fork and ensure the whole test matrix passes).
 - [x] Replace `` in the title with the actual Jira issue
   number, if there is one.
 - [x] If this contribution is large, please file an Apache
   [Individual Contributor License 
Agreement](https://www.apache.org/licenses/icla.txt).

---

R: @aljoscha AND @tgroh AND @bjchambers 

This is a rebase of #793 without much cleanup. The same commentary applies 
- I'm not totally happy with it but I want to ask for feedback.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/kennknowles/incubator-beam StateSpec

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-beam/pull/1044.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1044


commit 1f2de1e1212cbdb5eb3c8cb7caf30b94da5e0b00
Author: Kenneth Knowles 
Date:   2016-08-05T03:50:28Z

Create StateSpec parallel to StateTag

commit 6fa680858719eaf13edcc08aa3f3260584876fb5
Author: Kenneth Knowles 
Date:   2016-08-05T04:48:48Z

Make StateTag carry a StateSpec separately from its id




> Add user-ready API for interacting with state
> -
>
> Key: BEAM-25
> URL: https://issues.apache.org/jira/browse/BEAM-25
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-java-core
>Reporter: Kenneth Knowles
>Assignee: Kenneth Knowles
>  Labels: State
>
> Our current state API is targeted at runner implementers, not pipeline 
> authors. As such it has many capabilities that are not necessary nor 
> desirable for simple use cases of stateful ParDo (such as dynamic state tag 
> creation). Implement a simple state intended for user access.
> (Details of our current thoughts in forthcoming design doc)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] incubator-beam pull request #1044: [BEAM-25] Refactor StateSpec out of State...

2016-10-04 Thread kennknowles
GitHub user kennknowles opened a pull request:

https://github.com/apache/incubator-beam/pull/1044

[BEAM-25] Refactor StateSpec out of StateTag

Be sure to do all of the following to help us incorporate your contribution
quickly and easily:

 - [x] Make sure the PR title is formatted like:
   `[BEAM-] Description of pull request`
 - [ ] Make sure tests pass via `mvn clean verify`. (Even better, enable
   Travis-CI on your fork and ensure the whole test matrix passes).
 - [x] Replace `` in the title with the actual Jira issue
   number, if there is one.
 - [x] If this contribution is large, please file an Apache
   [Individual Contributor License 
Agreement](https://www.apache.org/licenses/icla.txt).

---

R: @aljoscha AND @tgroh AND @bjchambers 

This is a rebase of #793 without much cleanup. The same commentary applies 
- I'm not totally happy with it but I want to ask for feedback.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/kennknowles/incubator-beam StateSpec

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-beam/pull/1044.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1044


commit 1f2de1e1212cbdb5eb3c8cb7caf30b94da5e0b00
Author: Kenneth Knowles 
Date:   2016-08-05T03:50:28Z

Create StateSpec parallel to StateTag

commit 6fa680858719eaf13edcc08aa3f3260584876fb5
Author: Kenneth Knowles 
Date:   2016-08-05T04:48:48Z

Make StateTag carry a StateSpec separately from its id




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Comment Edited] (BEAM-702) Simple pattern for per-bundle and per-DoFn Closeable resources

2016-10-04 Thread Pei He (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15546277#comment-15546277
 ] 

Pei He edited comment on BEAM-702 at 10/4/16 6:47 PM:
--

The feature request is for something like this?
class MyDoFn extends DoFn {
  @Closable(Scope.BUNDLE)
  private DBWriter writer = null;

  @StartBundle
  public void startBundle(Context c) {
writer = createDBWriter();
  }

  @ProcessElement
  public void processElement(ProcessContext c) {
writer.write(...)
  }
}

I think runners can recognize Closable resources and their scopes through 
annotations, and they can generate the code to execute close() after the 
resources become out of the scope.

But, I think the real usefulness is more about "closed automatically" in both 
success and failure conditions.
Closing the resources manually in finishBundle() or teardown() work if the 
pipeline doesn't failure. Otherwise, if the pipeline fails, the resources might 
be left open.

This is the problem we have in BigQueryIO.Read, where we start external BQ 
extract jobs. However, if the pipeline fails, there is no hook to cancel them.

In summary for my 2 cents:
+1 for introducing mechanisms to recognize Closable resources in the model
And, it needs to work in both success and failure conditions.


was (Author: pei...@gmail.com):
The feature request is for something like this?
class MyDoFn extends DoFn {
  @Closable(Scope.BUNDLE)
  private DBWriter writer = null;

  @StartBundle
  public void startBundle(Context c) {
writer = createDBWriter();
  }

  @ProcessElement
  public void processElement(ProcessContext c) {
writer.write(...)
  }
}

I think runners can recognize Closable resources and their scopes through 
annotations, and they can generate the code the execute close() after the 
resources become out of the scope.

But, I think the real usefulness is more about "closed automatically" in both 
success and failure conditions.
Closing the resources manually in finishBundle() or teardown() work if the 
pipeline doesn't failure. Otherwise, if the pipeline fails, the resources might 
be left open.

This is the problem we have in BigQueryIO.Read, where we start external BQ 
extract jobs. However, if the pipeline fails, there is no hook to cancel them.

In summary for my 2 cents:
+1 for introducing mechanisms to recognize Closable resources in the model
And, it needs to work in both success and failure conditions.

> Simple pattern for per-bundle and per-DoFn Closeable resources
> --
>
> Key: BEAM-702
> URL: https://issues.apache.org/jira/browse/BEAM-702
> Project: Beam
>  Issue Type: Improvement
>Reporter: Eugene Kirpichov
>
> Dealing with Closeable resources inside a processElement call is easy: simply 
> use try-with-resources.
> However, bundle- or DoFn-scoped resources, such as long-lived database 
> connections, are less convenient to deal with: you have to open them in 
> startBundle and conditionally close in finishBundle (likewise 
> setup/teardown), taking special care if there's multiple resources to close 
> all of them.
> Perhaps we should provide something like Guava's Closer to DoFn's 
> https://github.com/google/guava/wiki/ClosingResourcesExplained. Ideally, the 
> user would need to only write a startBundle() or setup() method, but not 
> write finishBundle() or teardown() - resources would be closed automatically.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (BEAM-702) Simple pattern for per-bundle and per-DoFn Closeable resources

2016-10-04 Thread Pei He (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15546277#comment-15546277
 ] 

Pei He commented on BEAM-702:
-

The feature request is for something like this?
class MyDoFn extends DoFn {
  @Closable(Scope.BUNDLE)
  private DBWriter writer = null;

  @StartBundle
  public void startBundle(Context c) {
writer = createDBWriter();
  }

  @ProcessElement
  public void processElement(ProcessContext c) {
writer.write(...)
  }
}

I think runners can recognize Closable resources and their scopes through 
annotations, and they can generate the code the execute close() after the 
resources become out of the scope.

But, I think the real usefulness is more about "closed automatically" in both 
success and failure conditions.
Closing the resources manually in finishBundle() or teardown() work if the 
pipeline doesn't failure. Otherwise, if the pipeline fails, the resources might 
be left open.

This is the problem we have in BigQueryIO.Read, where we start external BQ 
extract jobs. However, if the pipeline fails, there is no hook to cancel them.

In summary for my 2 cents:
+1 for introducing mechanisms to recognize Closable resources in the model
And, it needs to work in both success and failure conditions.

> Simple pattern for per-bundle and per-DoFn Closeable resources
> --
>
> Key: BEAM-702
> URL: https://issues.apache.org/jira/browse/BEAM-702
> Project: Beam
>  Issue Type: Improvement
>Reporter: Eugene Kirpichov
>
> Dealing with Closeable resources inside a processElement call is easy: simply 
> use try-with-resources.
> However, bundle- or DoFn-scoped resources, such as long-lived database 
> connections, are less convenient to deal with: you have to open them in 
> startBundle and conditionally close in finishBundle (likewise 
> setup/teardown), taking special care if there's multiple resources to close 
> all of them.
> Perhaps we should provide something like Guava's Closer to DoFn's 
> https://github.com/google/guava/wiki/ClosingResourcesExplained. Ideally, the 
> user would need to only write a startBundle() or setup() method, but not 
> write finishBundle() or teardown() - resources would be closed automatically.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (BEAM-696) Side-Inputs non-deterministic with merging main-input windows

2016-10-04 Thread Kenneth Knowles (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15546263#comment-15546263
 ] 

Kenneth Knowles commented on BEAM-696:
--

The SDK needs to lay out a spec, which I think is what Pei is saying, and the 
runner comes up with the execution plan. To clarify - are you suggesting that 
{{CombineFn}} should only be allowed side input access in {{extractOutput}}, or 
are you suggesting that runners be required to wait until {{extractOutput}} 
_will_ be called before running a sequence of {{addInput}}* {{mergeAccum}}* 
{{extractoutput}} that accesses side inputs?

The latter sounds like it could be loosed to "give a consistent view of a side 
input to the sequence of {{addInput}}* {{mergeAccum}}* {{extractOutput}}" and 
your proposed execution plan is one obvious choice of how to achieve it.

> Side-Inputs non-deterministic with merging main-input windows
> -
>
> Key: BEAM-696
> URL: https://issues.apache.org/jira/browse/BEAM-696
> Project: Beam
>  Issue Type: Bug
>  Components: beam-model
>Reporter: Ben Chambers
>Assignee: Pei He
>
> Side-Inputs are non-deterministic for several reasons:
> 1. Because they depend on triggering of the side-input (this is acceptable 
> because triggers are by their nature non-deterministic).
> 2. They depend on the current state of the main-input window in order to 
> lookup the side-input. This means that with merging
> 3. Any runner optimizations that affect when the side-input is looked up may 
> cause problems with either or both of these.
> This issue focuses on #2 -- the non-determinism of side-inputs that execute 
> within a Merging WindowFn.
> Possible solution would be to defer running anything that looks up the 
> side-input until we need to extract an output, and using the main-window at 
> that point. Specifically, if the main-window is a MergingWindowFn, don't 
> execute any kind of pre-combine, instead buffer all the inputs and combine 
> later.
> This could still run into some non-determinism if there are triggers 
> controlling when we extract output.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (BEAM-703) SingletonViewFn might exhaust defaultValue if it's serialized after being used.

2016-10-04 Thread Kenneth Knowles (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15546196#comment-15546196
 ] 

Kenneth Knowles commented on BEAM-703:
--

I think that might have been an attempt to save memory by freeing up the 
encoded bytes. I agree with you that it is a bug.

> SingletonViewFn might exhaust defaultValue if it's serialized after being 
> used. 
> 
>
> Key: BEAM-703
> URL: https://issues.apache.org/jira/browse/BEAM-703
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core
>Reporter: Amit Sela
>Assignee: Amit Sela
>Priority: Minor
>
> In 
> https://github.com/apache/incubator-beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/util/PCollectionViews.java#L267
>  the defaultValue is set to null to avoid decoding over and over I assume.
> If the defaultValue is accessed before the the SingletonViewFn is serialized, 
> it will exhaust the encoded value (assigned with null) while losing the 
> transient decoded value.
> It'd probably be best to simply check if defaultValue is null before 
> decoding, so that decode will still happen just once, but the encoded data is 
> not lost. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] incubator-beam pull request #1043: Datastore Connector prototype using googl...

2016-10-04 Thread vikkyrk
GitHub user vikkyrk opened a pull request:

https://github.com/apache/incubator-beam/pull/1043

Datastore Connector prototype using google-cloud python client

Be sure to do all of the following to help us incorporate your contribution
quickly and easily:

 - [ ] Make sure the PR title is formatted like:
   `[BEAM-] Description of pull request`
 - [ ] Make sure tests pass via `mvn clean verify`. (Even better, enable
   Travis-CI on your fork and ensure the whole test matrix passes).
 - [ ] Replace `` in the title with the actual Jira issue
   number, if there is one.
 - [ ] If this contribution is large, please file an Apache
   [Individual Contributor License 
Agreement](https://www.apache.org/licenses/icla.txt).

---


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/vikkyrk/incubator-beam py_ds

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-beam/pull/1043.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1043


commit d553dd23ef346e74ab1eb48d6e29135bb8388b94
Author: Vikas Kedigehalli 
Date:   2016-09-29T18:07:11Z

DatastoreIO

commit ea430d914d06ccac6b778a545c8da0cf0a2c1753
Author: Vikas Kedigehalli 
Date:   2016-10-04T17:19:25Z

Datastore write




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-beam pull request #1042: BigQueryIO: port trivial fixes from Dataf...

2016-10-04 Thread peihe
GitHub user peihe opened a pull request:

https://github.com/apache/incubator-beam/pull/1042

BigQueryIO: port trivial fixes from Dataflow version.

Will rebase once PR-1039 is merged

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/peihe/incubator-beam 
bq-beam-dataflow-consistency

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-beam/pull/1042.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1042


commit 95cbdea9f55395fa60f5795d88e1178804e22865
Author: Pei He 
Date:   2016-10-04T02:37:02Z

Forward port Dataflow PR-431 to Beam

commit e42a4df4a414284a201a150bc7754570740ed644
Author: Pei He 
Date:   2016-10-04T03:39:32Z

Forward port Dataflow PR-454 to Beam

commit 3659e99e46ee0420347de84ae6661c0fad410d94
Author: Pei He 
Date:   2016-10-04T04:05:44Z

Forward port Dataflow PR-453 to Beam

commit 6c69d05352427586817af51e6e5221b9ac1d7890
Author: Pei He 
Date:   2016-10-04T04:19:37Z

BigQueryIO: port trivial fixes from Dataflow version.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-beam pull request #1041: Add a default bucket to Dataflow runner

2016-10-04 Thread sammcveety
GitHub user sammcveety opened a pull request:

https://github.com/apache/incubator-beam/pull/1041

Add a default bucket to Dataflow runner

There is a dependency failure that I'm not sure how to debug; all other 
tests pass.

@tgroh could you please take a look?
---



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sammcveety/incubator-beam sgmc/bucket_staging

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-beam/pull/1041.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1041


commit 28e935efb72b209da48981f55e4349f37597f34a
Author: sammcveety 
Date:   2016-10-03T18:21:41Z

Add a default bucket to Dataflow runner.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Updated] (BEAM-703) SingletonViewFn might exhaust defaultValue if it's serialized after being used.

2016-10-04 Thread Amit Sela (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amit Sela updated BEAM-703:
---
Summary: SingletonViewFn might exhaust defaultValue if it's serialized 
after being used.   (was: SingletonViewFn might exhaust defaultValue if it's 
serialized after used. )

> SingletonViewFn might exhaust defaultValue if it's serialized after being 
> used. 
> 
>
> Key: BEAM-703
> URL: https://issues.apache.org/jira/browse/BEAM-703
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core
>Reporter: Amit Sela
>Assignee: Amit Sela
>Priority: Minor
>
> In 
> https://github.com/apache/incubator-beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/util/PCollectionViews.java#L267
>  the defaultValue is set to null to avoid decoding over and over I assume.
> If the defaultValue is accessed before the the SingletonViewFn is serialized, 
> it will exhaust the encoded value (assigned with null) while losing the 
> transient decoded value.
> It'd probably be best to simply check if defaultValue is null before 
> decoding, so that decode will still happen just once, but the encoded data is 
> not lost. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (BEAM-703) SingletonViewFn might exhaust defaultValue if it's serialized after being used.

2016-10-04 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15545665#comment-15545665
 ] 

ASF GitHub Bot commented on BEAM-703:
-

GitHub user amitsela opened a pull request:

https://github.com/apache/incubator-beam/pull/1040

[BEAM-703] SingletonViewFn might exhaust defaultValue if it's serialized 
after being used.

Be sure to do all of the following to help us incorporate your contribution
quickly and easily:

 - [ ] Make sure the PR title is formatted like:
   `[BEAM-] Description of pull request`
 - [ ] Make sure tests pass via `mvn clean verify`. (Even better, enable
   Travis-CI on your fork and ensure the whole test matrix passes).
 - [ ] Replace `` in the title with the actual Jira issue
   number, if there is one.
 - [ ] If this contribution is large, please file an Apache
   [Individual Contributor License 
Agreement](https://www.apache.org/licenses/icla.txt).

---



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/amitsela/incubator-beam BEAM-703

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-beam/pull/1040.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1040


commit 0bd97efc0d764af17cdd8abdf43bff33bb21be2b
Author: Sela 
Date:   2016-10-04T15:12:38Z

Avoid losing the encoded defaultValue.




> SingletonViewFn might exhaust defaultValue if it's serialized after being 
> used. 
> 
>
> Key: BEAM-703
> URL: https://issues.apache.org/jira/browse/BEAM-703
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core
>Reporter: Amit Sela
>Assignee: Amit Sela
>Priority: Minor
>
> In 
> https://github.com/apache/incubator-beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/util/PCollectionViews.java#L267
>  the defaultValue is set to null to avoid decoding over and over I assume.
> If the defaultValue is accessed before the the SingletonViewFn is serialized, 
> it will exhaust the encoded value (assigned with null) while losing the 
> transient decoded value.
> It'd probably be best to simply check if defaultValue is null before 
> decoding, so that decode will still happen just once, but the encoded data is 
> not lost. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] incubator-beam pull request #1040: [BEAM-703] SingletonViewFn might exhaust ...

2016-10-04 Thread amitsela
GitHub user amitsela opened a pull request:

https://github.com/apache/incubator-beam/pull/1040

[BEAM-703] SingletonViewFn might exhaust defaultValue if it's serialized 
after being used.

Be sure to do all of the following to help us incorporate your contribution
quickly and easily:

 - [ ] Make sure the PR title is formatted like:
   `[BEAM-] Description of pull request`
 - [ ] Make sure tests pass via `mvn clean verify`. (Even better, enable
   Travis-CI on your fork and ensure the whole test matrix passes).
 - [ ] Replace `` in the title with the actual Jira issue
   number, if there is one.
 - [ ] If this contribution is large, please file an Apache
   [Individual Contributor License 
Agreement](https://www.apache.org/licenses/icla.txt).

---



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/amitsela/incubator-beam BEAM-703

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-beam/pull/1040.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1040


commit 0bd97efc0d764af17cdd8abdf43bff33bb21be2b
Author: Sela 
Date:   2016-10-04T15:12:38Z

Avoid losing the encoded defaultValue.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Created] (BEAM-703) SingletonViewFn might exhaust defaultValue if it's serialized after used.

2016-10-04 Thread Amit Sela (JIRA)
Amit Sela created BEAM-703:
--

 Summary: SingletonViewFn might exhaust defaultValue if it's 
serialized after used. 
 Key: BEAM-703
 URL: https://issues.apache.org/jira/browse/BEAM-703
 Project: Beam
  Issue Type: Bug
  Components: sdk-java-core
Reporter: Amit Sela
Assignee: Amit Sela
Priority: Minor


In 
https://github.com/apache/incubator-beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/util/PCollectionViews.java#L267
 the defaultValue is set to null to avoid decoding over and over I assume.
If the defaultValue is accessed before the the SingletonViewFn is serialized, 
it will exhaust the encoded value (assigned with null) while losing the 
transient decoded value.
It'd probably be best to simply check if defaultValue is null before decoding, 
so that decode will still happen just once, but the encoded data is not lost. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (BEAM-696) Side-Inputs non-deterministic with merging main-input windows

2016-10-04 Thread Amit Sela (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15544926#comment-15544926
 ] 

Amit Sela commented on BEAM-696:


[~pei...@gmail.com] shouldn't the SDK take care/enforce this behaviour and not 
let a runner decide this ? It sounds like for merging windows the CombineFns 
should lookup sideInputs only at extractOutput, otherwise I don't see how 
DataflowRunner is deferring, local-combine cannot guarantee the state of the 
merging window (in a specific bundle), correct ?

I think this issue is about the SDK / Runner API and should be resolved by 
either limiting the use of sideInputs with Sessions or enforcing/instructing 
runners to defer looking-up sideInput until extractOutput executes.

Am I missing something here ?  
  

> Side-Inputs non-deterministic with merging main-input windows
> -
>
> Key: BEAM-696
> URL: https://issues.apache.org/jira/browse/BEAM-696
> Project: Beam
>  Issue Type: Bug
>  Components: beam-model
>Reporter: Ben Chambers
>Assignee: Pei He
>
> Side-Inputs are non-deterministic for several reasons:
> 1. Because they depend on triggering of the side-input (this is acceptable 
> because triggers are by their nature non-deterministic).
> 2. They depend on the current state of the main-input window in order to 
> lookup the side-input. This means that with merging
> 3. Any runner optimizations that affect when the side-input is looked up may 
> cause problems with either or both of these.
> This issue focuses on #2 -- the non-determinism of side-inputs that execute 
> within a Merging WindowFn.
> Possible solution would be to defer running anything that looks up the 
> side-input until we need to extract an output, and using the main-window at 
> that point. Specifically, if the main-window is a MergingWindowFn, don't 
> execute any kind of pre-combine, instead buffer all the inputs and combine 
> later.
> This could still run into some non-determinism if there are triggers 
> controlling when we extract output.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)