[jira] [Updated] (BEAM-1188) More Verifiers For Python E2E Tests

2016-12-20 Thread Mark Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-1188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Liu updated BEAM-1188:
---
Description: 
Add more basic verifiers in e2e test to verify output data in different 
storage/fs:

1. File verifier: compute and verify checksum of file(s) that’s stored on a 
filesystem (GCS / local fs). 
2. Bigquery verifier: query from Bigquery table and verify response content. 

Also update TestOptions.on_success_matcher to accept a list of matchers instead 
of single one.

Note: Have retry when doing IO to avoid test flacky that may come from 
inconsistency of the filesystem. This problem happened in Java integration 
tests.




> More Verifiers For Python E2E Tests
> ---
>
> Key: BEAM-1188
> URL: https://issues.apache.org/jira/browse/BEAM-1188
> Project: Beam
>  Issue Type: Task
>  Components: sdk-py, testing
>Reporter: Mark Liu
>Assignee: Mark Liu
>
> Add more basic verifiers in e2e test to verify output data in different 
> storage/fs:
> 1. File verifier: compute and verify checksum of file(s) that’s stored on a 
> filesystem (GCS / local fs). 
> 2. Bigquery verifier: query from Bigquery table and verify response content. 
> Also update TestOptions.on_success_matcher to accept a list of matchers 
> instead of single one.
> Note: Have retry when doing IO to avoid test flacky that may come from 
> inconsistency of the filesystem. This problem happened in Java integration 
> tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (BEAM-1188) More Verifiers For Python E2E Tests

2016-12-20 Thread Mark Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-1188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Liu updated BEAM-1188:
---
Description: 
Add more basic verifiers in e2e test to verify output data in different 
storage/fs:

1. File verifier: compute and verify checksum of file(s) that’s stored on a 
filesystem (GCS / local fs). 
2. Bigquery verifier: query from Bigquery table and verify response content. 
...

Also update TestOptions.on_success_matcher to accept a list of matchers instead 
of single one.

Note: Have retry when doing IO to avoid test flacky that may come from 
inconsistency of the filesystem. This problem happened in Java integration 
tests.




  was:
Add more basic verifiers in e2e test to verify output data in different 
storage/fs:

1. File verifier: compute and verify checksum of file(s) that’s stored on a 
filesystem (GCS / local fs). 
2. Bigquery verifier: query from Bigquery table and verify response content. 

Also update TestOptions.on_success_matcher to accept a list of matchers instead 
of single one.

Note: Have retry when doing IO to avoid test flacky that may come from 
inconsistency of the filesystem. This problem happened in Java integration 
tests.





> More Verifiers For Python E2E Tests
> ---
>
> Key: BEAM-1188
> URL: https://issues.apache.org/jira/browse/BEAM-1188
> Project: Beam
>  Issue Type: Task
>  Components: sdk-py, testing
>Reporter: Mark Liu
>Assignee: Mark Liu
>
> Add more basic verifiers in e2e test to verify output data in different 
> storage/fs:
> 1. File verifier: compute and verify checksum of file(s) that’s stored on a 
> filesystem (GCS / local fs). 
> 2. Bigquery verifier: query from Bigquery table and verify response content. 
> ...
> Also update TestOptions.on_success_matcher to accept a list of matchers 
> instead of single one.
> Note: Have retry when doing IO to avoid test flacky that may come from 
> inconsistency of the filesystem. This problem happened in Java integration 
> tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (BEAM-1188) More Verifiers For Python E2E Tests

2016-12-20 Thread Mark Liu (JIRA)
Mark Liu created BEAM-1188:
--

 Summary: More Verifiers For Python E2E Tests
 Key: BEAM-1188
 URL: https://issues.apache.org/jira/browse/BEAM-1188
 Project: Beam
  Issue Type: Task
  Components: sdk-py, testing
Reporter: Mark Liu
Assignee: Mark Liu






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (BEAM-1091) Python E2E WordCount test in Precommit

2016-12-12 Thread Mark Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-1091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Liu updated BEAM-1091:
---
Summary: Python E2E WordCount test in Precommit  (was: E2E WordCount test 
in Precommit)

> Python E2E WordCount test in Precommit
> --
>
> Key: BEAM-1091
> URL: https://issues.apache.org/jira/browse/BEAM-1091
> Project: Beam
>  Issue Type: Task
>  Components: sdk-py, testing
>Reporter: Mark Liu
>Assignee: Mark Liu
>
> We want to include some e2e test in precommit in order to catch bugs earlier, 
> instead of breaking postcommit very often.
> As what we have in postcommit, we want the same wordcount test in precommit 
> executed by DirectPipelineRunner and DataflowRunner. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (BEAM-1124) Python ValidateRunner Test test_multi_valued_singleton_side_input Break Postcommit

2016-12-09 Thread Mark Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Liu updated BEAM-1124:
---
Description: 
Python test_multi_valued_singleton_side_input test, a ValidatesRunner test that 
running on dataflow service, failed and broke 
postcommit(https://builds.apache.org/view/Beam/job/beam_PostCommit_Python_Verify/853/).

Here is the stack trace:
{code}
Traceback (most recent call last):
  File 
"/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/sdks/python/apache_beam/dataflow_test.py",
 line 186, in test_multi_valued_singleton_side_input
pipeline.run()
  File 
"/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/sdks/python/apache_beam/pipeline.py",
 line 159, in run
return self.runner.run(self)
  File 
"/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/sdks/python/apache_beam/runners/dataflow_runner.py",
 line 195, in run
% getattr(self, 'last_error_msg', None), self.result)
DataflowRuntimeException: Dataflow pipeline failed:
(99aeafa7a8dffcc7): Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", 
line 514, in do_work
work_executor.execute()
  File "dataflow_worker/executor.py", line 892, in 
dataflow_worker.executor.MapTaskExecutor.execute 
(dataflow_worker/executor.c:24008)
op.start()
  File "dataflow_worker/executor.py", line 456, in 
dataflow_worker.executor.DoOperation.start (dataflow_worker/executor.c:13870)
def start(self):
  File "dataflow_worker/executor.py", line 483, in 
dataflow_worker.executor.DoOperation.start (dataflow_worker/executor.c:13685)
self.dofn_runner = common.DoFnRunner(
  File "apache_beam/runners/common.py", line 89, in 
apache_beam.runners.common.DoFnRunner.__init__ 
(apache_beam/runners/common.c:3469)
args, kwargs, [side_input[global_window]
  File 
"/usr/local/lib/python2.7/dist-packages/apache_beam/transforms/sideinputs.py", 
line 192, in __getitem__
_FilteringIterable(self._iterable, target_window), self._view_options)
  File "/usr/local/lib/python2.7/dist-packages/apache_beam/pvalue.py", line 
279, in _from_runtime_iterable
'PCollection with more than one element accessed as '
ValueError: PCollection with more than one element accessed as a singleton view.
{code}

Worker logs in here:
https://builds.apache.org/view/Beam/job/beam_PostCommit_Python_Verify/853/console

In order to temporarily ignore this test in postcommit, we can comment out 
annotation "@attr('ValidatesRunner')" of this test. Then it will only run as a 
unit test (execute by DirectRunner), but not run as a ValidatesRunner test.

  was:
Python test_multi_valued_singleton_side_input test, a ValidatesRunner test that 
running on dataflow service, failed and broke postcommit.

Here is the stack trace:
{code}
Traceback (most recent call last):
  File 
"/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/sdks/python/apache_beam/dataflow_test.py",
 line 186, in test_multi_valued_singleton_side_input
pipeline.run()
  File 
"/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/sdks/python/apache_beam/pipeline.py",
 line 159, in run
return self.runner.run(self)
  File 
"/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/sdks/python/apache_beam/runners/dataflow_runner.py",
 line 195, in run
% getattr(self, 'last_error_msg', None), self.result)
DataflowRuntimeException: Dataflow pipeline failed:
(99aeafa7a8dffcc7): Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", 
line 514, in do_work
work_executor.execute()
  File "dataflow_worker/executor.py", line 892, in 
dataflow_worker.executor.MapTaskExecutor.execute 
(dataflow_worker/executor.c:24008)
op.start()
  File "dataflow_worker/executor.py", line 456, in 
dataflow_worker.executor.DoOperation.start (dataflow_worker/executor.c:13870)
def start(self):
  File "dataflow_worker/executor.py", line 483, in 
dataflow_worker.executor.DoOperation.start (dataflow_worker/executor.c:13685)
self.dofn_runner = common.DoFnRunner(
  File "apache_beam/runners/common.py", line 89, in 
apache_beam.runners.common.DoFnRunner.__init__ 
(apache_beam/runners/common.c:3469)
args, kwargs, [side_input[global_window]
  File 
"/usr/local/lib/python2.7/dist-packages/apache_beam/transforms/sideinputs.py", 
line 192, in __getitem__
_FilteringIterable(self._iterable, target_window), self._view_options)
  File "/usr/local/lib/python2.7/dist-packages/apache_beam/pvalue.py", line 
279, in _from_runtime_iterable
'PCollection with more than one element accessed as '
ValueError: PCollection with more than one element accessed as a singleton view.
{code}

Worker logs in here:
https://builds.apache.org/view/Beam/job/beam_PostCommit_Python_Verify/853/console

In order to temporarily ignore this test in postcommit, 

[jira] [Created] (BEAM-1112) Python E2E Integration Test Framework - Batch Only

2016-12-08 Thread Mark Liu (JIRA)
Mark Liu created BEAM-1112:
--

 Summary: Python E2E Integration Test Framework - Batch Only
 Key: BEAM-1112
 URL: https://issues.apache.org/jira/browse/BEAM-1112
 Project: Beam
  Issue Type: Task
  Components: sdk-py, testing
Reporter: Mark Liu
Assignee: Mark Liu


Parity with Java. 

Build e2e integration test framework that can configure and run batch pipeline 
with specified test runner, wait for pipeline execution and verify results with 
given verifiers in the end.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (BEAM-1077) @ValidatesRunner test in Python postcommit

2016-12-08 Thread Mark Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-1077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Liu closed BEAM-1077.
--
   Resolution: Done
Fix Version/s: Not applicable

> @ValidatesRunner test in Python postcommit
> --
>
> Key: BEAM-1077
> URL: https://issues.apache.org/jira/browse/BEAM-1077
> Project: Beam
>  Issue Type: Test
>  Components: sdk-py, testing
>Reporter: Mark Liu
>Assignee: Mark Liu
> Fix For: Not applicable
>
>
> Modify run_postcommit.sh to have @ValidatesRunner tests running on service.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (BEAM-1109) Python ValidatesRunner Tests on Dataflow Service Timeout

2016-12-08 Thread Mark Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-1109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15731495#comment-15731495
 ] 

Mark Liu commented on BEAM-1109:


Actually the current ValidatesRunner tests are not running parallel since 
missing config in context. Also unique job_name should be assigned to each 
pipeline job instead of specifying a unique name in test execution command.

In order to make postcommit back to green soon, let's remove parallel config 
first and add it back later when fully tested. 

> Python ValidatesRunner Tests on Dataflow Service Timeout
> 
>
> Key: BEAM-1109
> URL: https://issues.apache.org/jira/browse/BEAM-1109
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py, testing
>Reporter: Mark Liu
>Assignee: Mark Liu
>
> ValidatesRunner tests timeout with following logs:
> https://builds.apache.org/view/Beam/job/beam_PostCommit_Python_Verify/839/console
> Need to increase "--process-timeout" in execution command 
> (https://github.com/apache/incubator-beam/blob/python-sdk/sdks/python/run_postcommit.sh#L77).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (BEAM-1109) Python ValidatesRunner Tests on Dataflow Service Timeout

2016-12-07 Thread Mark Liu (JIRA)
Mark Liu created BEAM-1109:
--

 Summary: Python ValidatesRunner Tests on Dataflow Service Timeout
 Key: BEAM-1109
 URL: https://issues.apache.org/jira/browse/BEAM-1109
 Project: Beam
  Issue Type: Bug
  Components: sdk-py, testing
Reporter: Mark Liu
Assignee: Mark Liu


ValidatesRunner tests timeout with following logs:
https://builds.apache.org/view/Beam/job/beam_PostCommit_Python_Verify/839/console

Need to increase "--process-timeout" in execution command 
(https://github.com/apache/incubator-beam/blob/python-sdk/sdks/python/run_postcommit.sh#L77).





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (BEAM-1091) E2E WordCount test in Precommit

2016-12-05 Thread Mark Liu (JIRA)
Mark Liu created BEAM-1091:
--

 Summary: E2E WordCount test in Precommit
 Key: BEAM-1091
 URL: https://issues.apache.org/jira/browse/BEAM-1091
 Project: Beam
  Issue Type: Task
  Components: sdk-py, testing
Reporter: Mark Liu
Assignee: Mark Liu


We want to include some e2e test in precommit in order to catch bugs earlier, 
instead of breaking postcommit very often.

As what we have in postcommit, we want the same wordcount test in precommit 
executed by DirectPipelineRunner and DataflowRunner. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (BEAM-719) Run WindowedWordCount Integration Test in Spark

2016-12-01 Thread Mark Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Liu reassigned BEAM-719:
-

Assignee: Mark Liu  (was: Amit Sela)

> Run WindowedWordCount Integration Test in Spark
> ---
>
> Key: BEAM-719
> URL: https://issues.apache.org/jira/browse/BEAM-719
> Project: Beam
>  Issue Type: Improvement
>  Components: testing
>Reporter: Mark Liu
>Assignee: Mark Liu
>
> The purpose of running WindowedWordCountIT in Spark is to have a streaming 
> test pipeline running in Jenkins pre-commit using TestSparkRunner.
> More discussion happened here:
> https://github.com/apache/incubator-beam/pull/1045#issuecomment-251531770



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (BEAM-605) Create BigQuery Verifier

2016-10-20 Thread Mark Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Liu resolved BEAM-605.
---
   Resolution: Done
Fix Version/s: Not applicable

> Create BigQuery Verifier
> 
>
> Key: BEAM-605
> URL: https://issues.apache.org/jira/browse/BEAM-605
> Project: Beam
>  Issue Type: Task
>  Components: testing
>Affects Versions: Not applicable
>Reporter: Mark Liu
>Assignee: Mark Liu
> Fix For: Not applicable
>
>
> Create BigQuery verifier that is used to verify output of integration test 
> which is using BigQuery as output source. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (BEAM-747) Text checksum verifier is not resilient to eventually consistent filesystems

2016-10-14 Thread Mark Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15576630#comment-15576630
 ] 

Mark Liu commented on BEAM-747:
---

Yes, it's worth having retry in file path matching and reading in order to 
handle IO failures from filesystem and some special cases like no file is 
found.  

As for example2, one place to add sharding name template is the ouputpath 
argument passing to FileChecksumMatcher. Instead of using ".../result*, we can 
use ".../result*-of-*". This can avoid reading irrelevant files but can't 
guaranty all shards are read unless given total number of shards. 

The current thought in my mind is passing the number of shards from command 
line as an optional test option, then pass it to the verifier. Not sure if we 
have a better way to do that. Since from previous test results, I found that 
the number of shards is runner dependent.

> Text checksum verifier is not resilient to eventually consistent filesystems
> 
>
> Key: BEAM-747
> URL: https://issues.apache.org/jira/browse/BEAM-747
> Project: Beam
>  Issue Type: Bug
>  Components: testing
>Affects Versions: Not applicable
>Reporter: Daniel Halperin
>Assignee: Mark Liu
>
> Example 1: 
> https://builds.apache.org/job/beam_PreCommit_MavenVerify/3934/org.apache.beam$beam-examples-java/console
> Here it looks like we need to retry listing files, at least a little bit, if 
> none are found. They did show up:
> {code}
> gsutil ls 
> gs://temp-storage-for-end-to-end-tests/WordCountIT-2016-10-13-12-37-02-467/output/results\*
> gs://temp-storage-for-end-to-end-tests/WordCountIT-2016-10-13-12-37-02-467/output/results-0-of-3
> gs://temp-storage-for-end-to-end-tests/WordCountIT-2016-10-13-12-37-02-467/output/results-1-of-3
> gs://temp-storage-for-end-to-end-tests/WordCountIT-2016-10-13-12-37-02-467/output/results-2-of-3
> {code}
> Example 2: 
> https://builds.apache.org/job/beam_PostCommit_MavenVerify/org.apache.beam$beam-examples-java/1525/testReport/junit/org.apache.beam.examples/WordCountIT/testE2EWordCount/
> Here it looks like we need to fill in the shard template if the filesystem 
> does not give us a consistent result:
> {code}
> Oct 14, 2016 12:31:16 AM org.apache.beam.sdk.testing.FileChecksumMatcher 
> readLines
> INFO: [0 of 1] Read 162 lines from file: 
> gs://temp-storage-for-end-to-end-tests/WordCountIT-2016-10-14-00-25-55-609/output/results-0-of-3
> Oct 14, 2016 12:31:16 AM org.apache.beam.sdk.testing.FileChecksumMatcher 
> readLines
> INFO: [1 of 1] Read 144 lines from file: 
> gs://temp-storage-for-end-to-end-tests/WordCountIT-2016-10-14-00-25-55-609/output/results-2-of-3
> Oct 14, 2016 12:31:16 AM org.apache.beam.sdk.testing.FileChecksumMatcher 
> matchesSafely
> INFO: Generated checksum for output data: 
> aec68948b2515e6ea35fd1ed7649c267a10a01e5
> {code}
> We missed shard 1-of-3 and hence got the wrong checksum.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (BEAM-720) Running WindowedWordCount Integration Test in Flink

2016-10-05 Thread Mark Liu (JIRA)
Mark Liu created BEAM-720:
-

 Summary: Running WindowedWordCount Integration Test in Flink
 Key: BEAM-720
 URL: https://issues.apache.org/jira/browse/BEAM-720
 Project: Beam
  Issue Type: Improvement
Reporter: Mark Liu
Assignee: Aljoscha Krettek


In order to have coverage of streaming pipeline test in pre-commit, it's 
important to have TestFlinkRunner to be able to run WindowedWordCountIT 
successfully. 

Relevant works in TestDataflowRunner:
https://github.com/apache/incubator-beam/pull/1045



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (BEAM-719) Running WindowedWordCount Integration Test in Spark

2016-10-05 Thread Mark Liu (JIRA)
Mark Liu created BEAM-719:
-

 Summary: Running WindowedWordCount Integration Test in Spark
 Key: BEAM-719
 URL: https://issues.apache.org/jira/browse/BEAM-719
 Project: Beam
  Issue Type: Improvement
Reporter: Mark Liu
Assignee: Amit Sela


The purpose of running WindowedWordCountIT in Spark is to have a streaming test 
pipeline running in Jenkins pre-commit using TestSparkRunner.

More discussion happened here:
https://github.com/apache/incubator-beam/pull/1045#issuecomment-251531770



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (BEAM-495) Generalize FileChecksumMatcher used for all E2E test

2016-09-09 Thread Mark Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Liu closed BEAM-495.
-

> Generalize FileChecksumMatcher used for all E2E test
> 
>
> Key: BEAM-495
> URL: https://issues.apache.org/jira/browse/BEAM-495
> Project: Beam
>  Issue Type: Improvement
>Reporter: Mark Liu
>Assignee: Mark Liu
> Fix For: Not applicable
>
>
> Refactor WordCountOnSuccessMatcher to be more general so that it can be 
> reused by other tests.
> Requirement:
> Given input file path (accept glob) and expected checksum, generate checksum 
> of file(s) and verify with expected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (BEAM-624) NoClassDefFoundError Failed Dataflow Streaming Job

2016-09-09 Thread Mark Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Liu resolved BEAM-624.
---
   Resolution: Fixed
Fix Version/s: Not applicable

> NoClassDefFoundError Failed Dataflow Streaming Job
> --
>
> Key: BEAM-624
> URL: https://issues.apache.org/jira/browse/BEAM-624
> Project: Beam
>  Issue Type: Bug
>Reporter: Mark Liu
>Assignee: Mark Liu
> Fix For: Not applicable
>
>
> NoClassDefFoundError when running streaming job on Dataflow service.
> Full stacktrace here when :
> {code}
> (bb04a33113307d77): Exception: java.lang.NoClassDefFoundError: 
> org/apache/beam/sdk/util/AttemptBoundedExponentialBackOff 
> com.google.cloud.dataflow.worker.runners.worker.WorkerCustomSources$UnboundedReaderIterator.advance(WorkerCustomSources.java:694)
>  
> com.google.cloud.dataflow.worker.util.common.worker.ReadOperation$SynchronizedReaderIterator.advance(ReadOperation.java:371)
>  
> com.google.cloud.dataflow.worker.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:198)
>  
> com.google.cloud.dataflow.worker.util.common.worker.ReadOperation.start(ReadOperation.java:139)
>  
> com.google.cloud.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:71)
>  
> com.google.cloud.dataflow.worker.runners.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:660)
>  
> com.google.cloud.dataflow.worker.runners.worker.StreamingDataflowWorker.access$500(StreamingDataflowWorker.java:89)
>  
> com.google.cloud.dataflow.worker.runners.worker.StreamingDataflowWorker$6.run(StreamingDataflowWorker.java:487)
>  
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> {code}
> Retore AttemptBounded/AttemptAndTimeBounded backoff classes which are removed 
> in this commit[1] in order to pass Dataflow streaming job against at HEAD.
> [1] 
> https://github.com/apache/incubator-beam/commit/dbbcbe604e167b306feac2443bec85f2da3c1dd6



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (BEAM-624) NoClassDefFoundError Failed Dataflow Streaming Job

2016-09-09 Thread Mark Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Liu closed BEAM-624.
-

> NoClassDefFoundError Failed Dataflow Streaming Job
> --
>
> Key: BEAM-624
> URL: https://issues.apache.org/jira/browse/BEAM-624
> Project: Beam
>  Issue Type: Bug
>Reporter: Mark Liu
>Assignee: Mark Liu
> Fix For: Not applicable
>
>
> NoClassDefFoundError when running streaming job on Dataflow service.
> Full stacktrace here when :
> {code}
> (bb04a33113307d77): Exception: java.lang.NoClassDefFoundError: 
> org/apache/beam/sdk/util/AttemptBoundedExponentialBackOff 
> com.google.cloud.dataflow.worker.runners.worker.WorkerCustomSources$UnboundedReaderIterator.advance(WorkerCustomSources.java:694)
>  
> com.google.cloud.dataflow.worker.util.common.worker.ReadOperation$SynchronizedReaderIterator.advance(ReadOperation.java:371)
>  
> com.google.cloud.dataflow.worker.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:198)
>  
> com.google.cloud.dataflow.worker.util.common.worker.ReadOperation.start(ReadOperation.java:139)
>  
> com.google.cloud.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:71)
>  
> com.google.cloud.dataflow.worker.runners.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:660)
>  
> com.google.cloud.dataflow.worker.runners.worker.StreamingDataflowWorker.access$500(StreamingDataflowWorker.java:89)
>  
> com.google.cloud.dataflow.worker.runners.worker.StreamingDataflowWorker$6.run(StreamingDataflowWorker.java:487)
>  
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> {code}
> Retore AttemptBounded/AttemptAndTimeBounded backoff classes which are removed 
> in this commit[1] in order to pass Dataflow streaming job against at HEAD.
> [1] 
> https://github.com/apache/incubator-beam/commit/dbbcbe604e167b306feac2443bec85f2da3c1dd6



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (BEAM-624) NoClassDefFoundError Failed Dataflow Streaming Job

2016-09-09 Thread Mark Liu (JIRA)
Mark Liu created BEAM-624:
-

 Summary: NoClassDefFoundError Failed Dataflow Streaming Job
 Key: BEAM-624
 URL: https://issues.apache.org/jira/browse/BEAM-624
 Project: Beam
  Issue Type: Bug
Reporter: Mark Liu
Assignee: Mark Liu


NoClassDefFoundError when running streaming job on Dataflow service.

Full stacktrace here when :
{code}
(bb04a33113307d77): Exception: java.lang.NoClassDefFoundError: 
org/apache/beam/sdk/util/AttemptBoundedExponentialBackOff 
com.google.cloud.dataflow.worker.runners.worker.WorkerCustomSources$UnboundedReaderIterator.advance(WorkerCustomSources.java:694)
 
com.google.cloud.dataflow.worker.util.common.worker.ReadOperation$SynchronizedReaderIterator.advance(ReadOperation.java:371)
 
com.google.cloud.dataflow.worker.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:198)
 
com.google.cloud.dataflow.worker.util.common.worker.ReadOperation.start(ReadOperation.java:139)
 
com.google.cloud.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:71)
 
com.google.cloud.dataflow.worker.runners.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:660)
 
com.google.cloud.dataflow.worker.runners.worker.StreamingDataflowWorker.access$500(StreamingDataflowWorker.java:89)
 
com.google.cloud.dataflow.worker.runners.worker.StreamingDataflowWorker$6.run(StreamingDataflowWorker.java:487)
 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
{code}

Retore AttemptBounded/AttemptAndTimeBounded backoff classes which are removed 
in this commit[1] in order to pass Dataflow streaming job against at HEAD.

[1] 
https://github.com/apache/incubator-beam/commit/dbbcbe604e167b306feac2443bec85f2da3c1dd6



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (BEAM-604) Use Watermark Check Streaming Job Finish in TestDataflowRunner

2016-09-02 Thread Mark Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Liu updated BEAM-604:
--
Description: 
Currently, streaming job with bounded input can't be terminated automatically 
and TestDataflowRunner can't handle this case. Need to update 
TestDataflowRunner so that streaming integration test such as 
WindowedWordCountIT can run with it.

Implementation:
Query watermark of each step and wait until all watermarks set to MAX then 
cancel the job.

Update:
Suggesting by [~pei...@gmail.com], implement checkMaxWatermark in 
DataflowPipelineJob#waitUntilFinish. Thus, all dataflow streaming jobs with 
bounded input will take advantage of this change and are canceled automatically 
when watermarks reach to max value. Also Dataflow runners can keep simple and 
free from handling batch and streaming two cases.

Update:
Pipeline author should have control on whether or not canceling streaming job 
and when. Test framework is a better place to auto-cancel streaming test job 
when curtain conditions meet, rather than in waitUntilFinish().


  was:
Currently, streaming job with bounded input can't be terminated automatically 
and TestDataflowRunner can't handle this case. Need to update 
TestDataflowRunner so that streaming integration test such as 
WindowedWordCountIT can run with it.

Implementation:
Query watermark of each step and wait until all watermarks set to MAX then 
cancel the job.

Update:
Suggesting by [~pei...@gmail.com], implement checkMaxWatermark in 
DataflowPipelineJob#waitUntilFinish. Thus, all dataflow streaming jobs with 
bounded input will take advantage of this change and are canceled automatically 
when watermarks reach to max value. Also Dataflow runners can keep simple and 
free from handling batch and streaming two cases.

Update:idile 
1. pipeline author have control on whether or not canceling streaming job and 
when. The ideal way to do is:
{code}
job = pipeline.run();
job.waitUntilFinish();
job.cancel();
{code}



> Use Watermark Check Streaming Job Finish in TestDataflowRunner
> --
>
> Key: BEAM-604
> URL: https://issues.apache.org/jira/browse/BEAM-604
> Project: Beam
>  Issue Type: Improvement
>Reporter: Mark Liu
>Assignee: Mark Liu
>Priority: Minor
>
> Currently, streaming job with bounded input can't be terminated automatically 
> and TestDataflowRunner can't handle this case. Need to update 
> TestDataflowRunner so that streaming integration test such as 
> WindowedWordCountIT can run with it.
> Implementation:
> Query watermark of each step and wait until all watermarks set to MAX then 
> cancel the job.
> Update:
> Suggesting by [~pei...@gmail.com], implement checkMaxWatermark in 
> DataflowPipelineJob#waitUntilFinish. Thus, all dataflow streaming jobs with 
> bounded input will take advantage of this change and are canceled 
> automatically when watermarks reach to max value. Also Dataflow runners can 
> keep simple and free from handling batch and streaming two cases.
> Update:
> Pipeline author should have control on whether or not canceling streaming job 
> and when. Test framework is a better place to auto-cancel streaming test job 
> when curtain conditions meet, rather than in waitUntilFinish().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (BEAM-604) Use Watermark Check Streaming Job Finish in TestDataflowRunner

2016-09-02 Thread Mark Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Liu updated BEAM-604:
--
Description: 
Currently, streaming job with bounded input can't be terminated automatically 
and TestDataflowRunner can't handle this case. Need to update 
TestDataflowRunner so that streaming integration test such as 
WindowedWordCountIT can run with it.

Implementation:
Query watermark of each step and wait until all watermarks set to MAX then 
cancel the job.

Update:
Suggesting by [~pei...@gmail.com], implement checkMaxWatermark in 
DataflowPipelineJob#waitUntilFinish. Thus, all dataflow streaming jobs with 
bounded input will take advantage of this change and are canceled automatically 
when watermarks reach to max value. Also Dataflow runners can keep simple and 
free from handling batch and streaming two cases.

Update:idile 
1. pipeline author have control on whether or not canceling streaming job and 
when. The ideal way to do is:
{code}
job = pipeline.run();
job.waitUntilFinish();
job.cancel();
{code}


  was:
Currently, streaming job with bounded input can't be terminated automatically 
and TestDataflowRunner can't handle this case. Need to update 
TestDataflowRunner so that streaming integration test such as 
WindowedWordCountIT can run with it.

Implementation:
Query watermark of each step and wait until all watermarks set to MAX then 
cancel the job.

Update:
Suggesting by [~pei...@gmail.com], implement checkMaxWatermark in 
DataflowPipelineJob#waitUntilFinish. Thus, all dataflow streaming jobs with 
bounded input will take advantage of this change and are canceled automatically 
when watermarks reach to max value. Also Dataflow runners can keep simple and 
free from handling batch and streaming two cases.

Update:



> Use Watermark Check Streaming Job Finish in TestDataflowRunner
> --
>
> Key: BEAM-604
> URL: https://issues.apache.org/jira/browse/BEAM-604
> Project: Beam
>  Issue Type: Improvement
>Reporter: Mark Liu
>Assignee: Mark Liu
>Priority: Minor
>
> Currently, streaming job with bounded input can't be terminated automatically 
> and TestDataflowRunner can't handle this case. Need to update 
> TestDataflowRunner so that streaming integration test such as 
> WindowedWordCountIT can run with it.
> Implementation:
> Query watermark of each step and wait until all watermarks set to MAX then 
> cancel the job.
> Update:
> Suggesting by [~pei...@gmail.com], implement checkMaxWatermark in 
> DataflowPipelineJob#waitUntilFinish. Thus, all dataflow streaming jobs with 
> bounded input will take advantage of this change and are canceled 
> automatically when watermarks reach to max value. Also Dataflow runners can 
> keep simple and free from handling batch and streaming two cases.
> Update:idile 
> 1. pipeline author have control on whether or not canceling streaming job and 
> when. The ideal way to do is:
> {code}
> job = pipeline.run();
> job.waitUntilFinish();
> job.cancel();
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (BEAM-604) Use Watermark Check Streaming Job Finish in TestDataflowRunner

2016-09-02 Thread Mark Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Liu updated BEAM-604:
--
Description: 
Currently, streaming job with bounded input can't be terminated automatically 
and TestDataflowRunner can't handle this case. Need to update 
TestDataflowRunner so that streaming integration test such as 
WindowedWordCountIT can run with it.

Implementation:
Query watermark of each step and wait until all watermarks set to MAX then 
cancel the job.

Update:
Suggesting by [~pei...@gmail.com], implement checkMaxWatermark in 
DataflowPipelineJob#waitUntilFinish. Thus, all dataflow streaming jobs with 
bounded input will take advantage of this change and are canceled automatically 
when watermarks reach to max value. Also Dataflow runners can keep simple and 
free from handling batch and streaming two cases.

Update:


  was:
Currently, streaming job with bounded input can't be terminated automatically 
and TestDataflowRunner can't handle this case. Need to update 
TestDataflowRunner so that streaming integration test such as 
WindowedWordCountIT can run with it.

Implementation:
Query watermark of each step and wait until all watermarks set to MAX then 
cancel the job.

Update:
Suggesting by [~pei...@gmail.com], implement checkMaxWatermark in 
DataflowPipelineJob#waitUntilFinish. Thus, all dataflow streaming jobs with 
bounded input will take advantage of this change and are canceled automatically 
when watermarks reach to max value. Also Dataflow runners can keep simple and 
free from handling batch and streaming two cases.


> Use Watermark Check Streaming Job Finish in TestDataflowRunner
> --
>
> Key: BEAM-604
> URL: https://issues.apache.org/jira/browse/BEAM-604
> Project: Beam
>  Issue Type: Improvement
>Reporter: Mark Liu
>Assignee: Mark Liu
>Priority: Minor
>
> Currently, streaming job with bounded input can't be terminated automatically 
> and TestDataflowRunner can't handle this case. Need to update 
> TestDataflowRunner so that streaming integration test such as 
> WindowedWordCountIT can run with it.
> Implementation:
> Query watermark of each step and wait until all watermarks set to MAX then 
> cancel the job.
> Update:
> Suggesting by [~pei...@gmail.com], implement checkMaxWatermark in 
> DataflowPipelineJob#waitUntilFinish. Thus, all dataflow streaming jobs with 
> bounded input will take advantage of this change and are canceled 
> automatically when watermarks reach to max value. Also Dataflow runners can 
> keep simple and free from handling batch and streaming two cases.
> Update:



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (BEAM-604) Use Watermark Check Streaming Job Finish in TestDataflowRunner

2016-09-02 Thread Mark Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Liu updated BEAM-604:
--
Summary: Use Watermark Check Streaming Job Finish in TestDataflowRunner  
(was: Use Watermark Check Streaming Job Finish in DataflowPipelineJob)

> Use Watermark Check Streaming Job Finish in TestDataflowRunner
> --
>
> Key: BEAM-604
> URL: https://issues.apache.org/jira/browse/BEAM-604
> Project: Beam
>  Issue Type: Improvement
>Reporter: Mark Liu
>Assignee: Mark Liu
>Priority: Minor
>
> Currently, streaming job with bounded input can't be terminated automatically 
> and TestDataflowRunner can't handle this case. Need to update 
> TestDataflowRunner so that streaming integration test such as 
> WindowedWordCountIT can run with it.
> Implementation:
> Query watermark of each step and wait until all watermarks set to MAX then 
> cancel the job.
> Update:
> Suggesting by [~pei...@gmail.com], implement checkMaxWatermark in 
> DataflowPipelineJob#waitUntilFinish. Thus, all dataflow streaming jobs with 
> bounded input will take advantage of this change and are canceled 
> automatically when watermarks reach to max value. Also Dataflow runners can 
> keep simple and free from handling batch and streaming two cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (BEAM-604) Use Watermark Check Streaming Job Finish in DataflowPipelineJob

2016-08-31 Thread Mark Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Liu updated BEAM-604:
--
Description: 
Currently, streaming job with bounded input can't be terminated automatically 
and TestDataflowRunner can't handle this case. Need to update 
TestDataflowRunner so that streaming integration test such as 
WindowedWordCountIT can run with it.

Implementation:
Query watermark of each step and wait until all watermarks set to MAX then 
cancel the job.

Update:
Suggesting by [~pei...@gmail.com], implement checkMaxWatermark in 
DataflowPipelineJob#waitUntilFinish. Thus, all dataflow streaming jobs with 
bounded input will take advantage of this change and are canceled automatically 
when watermarks reach to max value. Also Dataflow runners can keep simple and 
free from handling batch and streaming two cases.

  was:
Currently, streaming job with bounded input can't be terminated automatically 
and TestDataflowRunner can't handle this case. Need to update 
TestDataflowRunner so that streaming integration test such as 
WindowedWordCountIT can run with it.

Implementation:
Query watermark of each step and wait until all watermarks set to MAX then 
cancel the job.

Update:
Suggesting by [~pei...@gmail.com], implement checkMaxWatermark in 
DataflowPipelineJob#waitUntilFinish. Thus, all dataflow streaming jobs with 
bounded input will take advantage of this change and are canceled automatically 
when watermarks reach to max value. Also 


> Use Watermark Check Streaming Job Finish in DataflowPipelineJob
> ---
>
> Key: BEAM-604
> URL: https://issues.apache.org/jira/browse/BEAM-604
> Project: Beam
>  Issue Type: Improvement
>Reporter: Mark Liu
>Assignee: Mark Liu
>Priority: Minor
>
> Currently, streaming job with bounded input can't be terminated automatically 
> and TestDataflowRunner can't handle this case. Need to update 
> TestDataflowRunner so that streaming integration test such as 
> WindowedWordCountIT can run with it.
> Implementation:
> Query watermark of each step and wait until all watermarks set to MAX then 
> cancel the job.
> Update:
> Suggesting by [~pei...@gmail.com], implement checkMaxWatermark in 
> DataflowPipelineJob#waitUntilFinish. Thus, all dataflow streaming jobs with 
> bounded input will take advantage of this change and are canceled 
> automatically when watermarks reach to max value. Also Dataflow runners can 
> keep simple and free from handling batch and streaming two cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (BEAM-604) Use Watermark Check Streaming Job Finish in DataflowPipelineJob

2016-08-31 Thread Mark Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Liu updated BEAM-604:
--
Description: 
Currently, streaming job with bounded input can't be terminated automatically 
and TestDataflowRunner can't handle this case. Need to update 
TestDataflowRunner so that streaming integration test such as 
WindowedWordCountIT can run with it.

Implementation:
Query watermark of each step and wait until all watermarks set to MAX then 
cancel the job.

Update:
Suggesting by [~pei...@gmail.com], implement checkMaxWatermark in 
DataflowPipelineJob#waitUntilFinish. Thus, all dataflow streaming jobs with 
bounded input will take advantage of this change and are canceled automatically 
when watermarks reach to max value. Also 

  was:
Currently, streaming job with bounded input can't be terminated automatically 
and TestDataflowRunner can't handle this case. Need to update 
TestDataflowRunner so that streaming integration test such as 
WindowedWordCountIT can run with it.

Implementation:
Query watermark of each step and wait until all watermarks set to MAX then 
cancel the job.


> Use Watermark Check Streaming Job Finish in DataflowPipelineJob
> ---
>
> Key: BEAM-604
> URL: https://issues.apache.org/jira/browse/BEAM-604
> Project: Beam
>  Issue Type: Improvement
>Reporter: Mark Liu
>Assignee: Mark Liu
>Priority: Minor
>
> Currently, streaming job with bounded input can't be terminated automatically 
> and TestDataflowRunner can't handle this case. Need to update 
> TestDataflowRunner so that streaming integration test such as 
> WindowedWordCountIT can run with it.
> Implementation:
> Query watermark of each step and wait until all watermarks set to MAX then 
> cancel the job.
> Update:
> Suggesting by [~pei...@gmail.com], implement checkMaxWatermark in 
> DataflowPipelineJob#waitUntilFinish. Thus, all dataflow streaming jobs with 
> bounded input will take advantage of this change and are canceled 
> automatically when watermarks reach to max value. Also 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (BEAM-604) Use Watermark Check Streaming Job Finish in DataflowPipelineJob

2016-08-31 Thread Mark Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Liu updated BEAM-604:
--
Summary: Use Watermark Check Streaming Job Finish in DataflowPipelineJob  
(was: Use Watermark Check Streaming Job Finish in TestDataflowRunner )

> Use Watermark Check Streaming Job Finish in DataflowPipelineJob
> ---
>
> Key: BEAM-604
> URL: https://issues.apache.org/jira/browse/BEAM-604
> Project: Beam
>  Issue Type: Improvement
>Reporter: Mark Liu
>Assignee: Mark Liu
>Priority: Minor
>
> Currently, streaming job with bounded input can't be terminated automatically 
> and TestDataflowRunner can't handle this case. Need to update 
> TestDataflowRunner so that streaming integration test such as 
> WindowedWordCountIT can run with it.
> Implementation:
> Query watermark of each step and wait until all watermarks set to MAX then 
> cancel the job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (BEAM-572) Remove Spark references in WordCount

2016-08-31 Thread Mark Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15452754#comment-15452754
 ] 

Mark Liu commented on BEAM-572:
---

PR is closed. This jira can be marked as resolved.

> Remove Spark references in WordCount
> 
>
> Key: BEAM-572
> URL: https://issues.apache.org/jira/browse/BEAM-572
> Project: Beam
>  Issue Type: Bug
>  Components: examples-java
>Reporter: Pei He
>Assignee: Mark Liu
>
> Examples should be runner agnostics.
> We don't want to have Spark references in
> https://github.com/apache/incubator-beam/blob/master/examples/java/src/main/java/org/apache/beam/examples/WordCount.java



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (BEAM-604) Use Watermark Check Streaming Job Finish in TestDataflowRunner

2016-08-30 Thread Mark Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15449381#comment-15449381
 ] 

Mark Liu commented on BEAM-604:
---

Right. [~pei...@gmail.com] may have some context on this.

> Use Watermark Check Streaming Job Finish in TestDataflowRunner 
> ---
>
> Key: BEAM-604
> URL: https://issues.apache.org/jira/browse/BEAM-604
> Project: Beam
>  Issue Type: Improvement
>Reporter: Mark Liu
>Assignee: Mark Liu
>Priority: Minor
>
> Currently, streaming job with bounded input can't be terminated automatically 
> and TestDataflowRunner can't handle this case. Need to update 
> TestDataflowRunner so that streaming integration test such as 
> WindowedWordCountIT can run with it.
> Implementation:
> Query watermark of each step and wait until all watermarks set to MAX then 
> cancel the job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (BEAM-605) Create BigQuery Verifier

2016-08-29 Thread Mark Liu (JIRA)
Mark Liu created BEAM-605:
-

 Summary: Create BigQuery Verifier
 Key: BEAM-605
 URL: https://issues.apache.org/jira/browse/BEAM-605
 Project: Beam
  Issue Type: Task
Reporter: Mark Liu


Create BigQuery verifier that is used to verify output of integration test 
which is using BigQuery as output source. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (BEAM-604) Use Watermark Check Streaming Job Finish in TestDataflowRunner

2016-08-29 Thread Mark Liu (JIRA)
Mark Liu created BEAM-604:
-

 Summary: Use Watermark Check Streaming Job Finish in 
TestDataflowRunner 
 Key: BEAM-604
 URL: https://issues.apache.org/jira/browse/BEAM-604
 Project: Beam
  Issue Type: Bug
Reporter: Mark Liu
Assignee: Mark Liu


Currently, streaming job with bounded input can't be terminated automatically 
and TestDataflowRunner can't handle this case. Need to update 
TestDataflowRunner so that streaming integration test such as 
WindowedWordCountIT can run with it.

Implementation:
Query watermark of each step and wait until all watermarks set to MAX then 
cancel the job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (BEAM-594) Support cancel() and waitUntilFinish() in FlinkRunnerResult

2016-08-29 Thread Mark Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15446359#comment-15446359
 ] 

Mark Liu commented on BEAM-594:
---

duplicated with BEAM-593

> Support cancel() and waitUntilFinish() in FlinkRunnerResult
> ---
>
> Key: BEAM-594
> URL: https://issues.apache.org/jira/browse/BEAM-594
> Project: Beam
>  Issue Type: New Feature
>  Components: runner-flink
>Reporter: Pei He
>
> We introduced both functions to PipelineResult.
> Currently, both of them throw UnsupportedOperationException in Flink runner.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (BEAM-592) StackOverflowError Failed example/java/WordCount When Using SparkRunner

2016-08-26 Thread Mark Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Liu updated BEAM-592:
--
Description: 
WordCount(example/java/WordCount) failed running with sparkRunner in following 
command:
{code}
mvn clean compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount 
-Dexec.args="--inputFile=/tmp/kinglear.txt --runner=SparkRunner 
--sparkMaster=local --tempLocation=/tmp/out"
{code}
Following is part of stacktrace:
{code}
Caused by: java.lang.StackOverflowError
at 
java.util.concurrent.CopyOnWriteArrayList.toArray(CopyOnWriteArrayList.java:374)
at java.util.logging.Logger.accessCheckedHandlers(Logger.java:1782)
at 
java.util.logging.LogManager$RootLogger.accessCheckedHandlers(LogManager.java:1668)
at java.util.logging.Logger.getHandlers(Logger.java:1776)
at java.util.logging.Logger.log(Logger.java:735)
at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:580)
at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:650)
at 
org.slf4j.bridge.SLF4JBridgeHandler.callLocationAwareLogger(SLF4JBridgeHandler.java:224)
at 
org.slf4j.bridge.SLF4JBridgeHandler.publish(SLF4JBridgeHandler.java:301)
at java.util.logging.Logger.log(Logger.java:738)
at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:580)
at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:650)
at 
org.slf4j.bridge.SLF4JBridgeHandler.callLocationAwareLogger(SLF4JBridgeHandler.java:224)
at 
org.slf4j.bridge.SLF4JBridgeHandler.publish(SLF4JBridgeHandler.java:301)
...
{code}

According to [http://slf4j.org/legacy.html#jul-to-slf4j] and in particular 
section: "jul-to-slf4j.jar and slf4j-jdk14.jar cannot be present 
simultaneously".

Change slf4j-jdk14 dependency scope to test solve the above problem, but 
WordCountIT still failed in same reason. 

  was:
WordCount failed running with sparkRunner in following command:
{code}
mvn clean compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount 
-Dexec.args="--inputFile=/tmp/kinglear.txt --runner=SparkRunner 
--sparkMaster=local --tempLocation=/tmp/out"
{code}
Following is part of stacktrace:
{code}
Caused by: java.lang.StackOverflowError
at 
java.util.concurrent.CopyOnWriteArrayList.toArray(CopyOnWriteArrayList.java:374)
at java.util.logging.Logger.accessCheckedHandlers(Logger.java:1782)
at 
java.util.logging.LogManager$RootLogger.accessCheckedHandlers(LogManager.java:1668)
at java.util.logging.Logger.getHandlers(Logger.java:1776)
at java.util.logging.Logger.log(Logger.java:735)
at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:580)
at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:650)
at 
org.slf4j.bridge.SLF4JBridgeHandler.callLocationAwareLogger(SLF4JBridgeHandler.java:224)
at 
org.slf4j.bridge.SLF4JBridgeHandler.publish(SLF4JBridgeHandler.java:301)
at java.util.logging.Logger.log(Logger.java:738)
at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:580)
at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:650)
at 
org.slf4j.bridge.SLF4JBridgeHandler.callLocationAwareLogger(SLF4JBridgeHandler.java:224)
at 
org.slf4j.bridge.SLF4JBridgeHandler.publish(SLF4JBridgeHandler.java:301)
...
{code}

According to [http://slf4j.org/legacy.html#jul-to-slf4j] and in particular 
section: "jul-to-slf4j.jar and slf4j-jdk14.jar cannot be present 
simultaneously".

Change slf4j-jdk14 dependency scope to test solve the above problem, but 
WordCountIT still failed in same reason. 


> StackOverflowError Failed example/java/WordCount When Using SparkRunner
> ---
>
> Key: BEAM-592
> URL: https://issues.apache.org/jira/browse/BEAM-592
> Project: Beam
>  Issue Type: Bug
>Reporter: Mark Liu
>
> WordCount(example/java/WordCount) failed running with sparkRunner in 
> following command:
> {code}
> mvn clean compile exec:java 
> -Dexec.mainClass=org.apache.beam.examples.WordCount 
> -Dexec.args="--inputFile=/tmp/kinglear.txt --runner=SparkRunner 
> --sparkMaster=local --tempLocation=/tmp/out"
> {code}
> Following is part of stacktrace:
> {code}
> Caused by: java.lang.StackOverflowError
>   at 
> java.util.concurrent.CopyOnWriteArrayList.toArray(CopyOnWriteArrayList.java:374)
>   at java.util.logging.Logger.accessCheckedHandlers(Logger.java:1782)
>   at 
> java.util.logging.LogManager$RootLogger.accessCheckedHandlers(LogManager.java:1668)
>   at java.util.logging.Logger.getHandlers(Logger.java:1776)
>   at java.util.logging.Logger.log(Logger.java:735)
>   at org.slf4j.impl.JDK14LoggerAdapter.log(JDK14LoggerAdapter.java:580)
>   at 

[jira] [Commented] (BEAM-583) Auto Register TestDataflowRunner

2016-08-24 Thread Mark Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15435790#comment-15435790
 ] 

Mark Liu commented on BEAM-583:
---

Make sense for me. 

For me, TestPipeline is more used in unit and functional test and people should 
really use it. But once a pipeline is build and people want to test it from end 
to end, TestYYYRunner + IT framework will be a better choice.

> Auto Register TestDataflowRunner 
> -
>
> Key: BEAM-583
> URL: https://issues.apache.org/jira/browse/BEAM-583
> Project: Beam
>  Issue Type: Improvement
>Reporter: Mark Liu
>Assignee: Mark Liu
>
> Register TestDataflowRunner automatically.
> Simplify option's arguments when using TestDataflowRunner run end-to-end test 
> against dataflow service. Instead of using 
> {code}--runner=org.apache.beam.runners.dataflow.testing.TestDataflowRunner{code},
>  we can also use {code}--runner=TestDataflowRunner{code}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (BEAM-583) Auto Register TestDataflowRunner

2016-08-24 Thread Mark Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15435760#comment-15435760
 ] 

Mark Liu commented on BEAM-583:
---

I can't think of a case that the pipeline author use TestDataflowRunner 
directly. it is used to kick off an integration test like WordCountIT. So do we 
want to only register runners that pipeline author use directly?

> Auto Register TestDataflowRunner 
> -
>
> Key: BEAM-583
> URL: https://issues.apache.org/jira/browse/BEAM-583
> Project: Beam
>  Issue Type: Improvement
>Reporter: Mark Liu
>Assignee: Mark Liu
>
> Register TestDataflowRunner automatically.
> Simplify option's arguments when using TestDataflowRunner run end-to-end test 
> against dataflow service. Instead of using 
> {code}--runner=org.apache.beam.runners.dataflow.testing.TestDataflowRunner{code},
>  we can also use {code}--runner=TestDataflowRunner{code}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (BEAM-583) Auto Register TestDataflowRunner

2016-08-24 Thread Mark Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15435739#comment-15435739
 ] 

Mark Liu commented on BEAM-583:
---

using {code}--runner=TestDataflowRunner{code}will cause following error which 
may confuse people.

{code}
java.lang.IllegalArgumentException: Unknown 'runner' specified 
'TestDataflowRunner', supported pipeline runners [BlockingDataflowRunner, 
DataflowRunner, DirectRunner, FlinkRunner, SparkRunner, TestFlinkRunner, 
TestSparkRunner]
{code}

> Auto Register TestDataflowRunner 
> -
>
> Key: BEAM-583
> URL: https://issues.apache.org/jira/browse/BEAM-583
> Project: Beam
>  Issue Type: Improvement
>Reporter: Mark Liu
>Assignee: Mark Liu
>
> Register TestDataflowRunner automatically.
> Simplify option's arguments when using TestDataflowRunner run end-to-end test 
> against dataflow service. Instead of using 
> {code}--runner=org.apache.beam.runners.dataflow.testing.TestDataflowRunner{code},
>  we can also use {code}--runner=TestDataflowRunner{code}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (BEAM-583) Auto Register TestDataflowRunner

2016-08-24 Thread Mark Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15435710#comment-15435710
 ] 

Mark Liu commented on BEAM-583:
---

Not figure out how to use --help, but according to PipelineOptionsFactory code, 
I only find registered options will be listed.

I'm intend to register TestDataflowRunner in DataflowPipelineRegistrar if it 
make sense to you.

> Auto Register TestDataflowRunner 
> -
>
> Key: BEAM-583
> URL: https://issues.apache.org/jira/browse/BEAM-583
> Project: Beam
>  Issue Type: Improvement
>Reporter: Mark Liu
>Assignee: Mark Liu
>
> Register TestDataflowRunner automatically.
> Simplify option's arguments when using TestDataflowRunner run end-to-end test 
> against dataflow service. Instead of using 
> {code}--runner=org.apache.beam.runners.dataflow.testing.TestDataflowRunner{code},
>  we can also use {code}--runner=TestDataflowRunner{code}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (BEAM-583) Auto Register TestDataflowRunner

2016-08-24 Thread Mark Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Liu updated BEAM-583:
--
Description: 
Register TestDataflowRunner automatically.

Simplify option's arguments when using TestDataflowRunner run end-to-end test 
against dataflow service. Instead of using 
{code}--runner=org.apache.beam.runners.dataflow.testing.TestDataflowRunner{code},
 we can also use {code}--runner=TestDataflowRunner{code}.

  was:
Register TestDataflowRunner automatically.

Simplify option's arguments when using TestDataflowRunner run end-to-end test 
against dataflow service. Instead of using 
"--runner=org.apache.beam.runners.dataflow.testing.TestDataflowRunner", we can 
also use "--runner=TestDataflowRunner".


> Auto Register TestDataflowRunner 
> -
>
> Key: BEAM-583
> URL: https://issues.apache.org/jira/browse/BEAM-583
> Project: Beam
>  Issue Type: Improvement
>Reporter: Mark Liu
>Assignee: Mark Liu
>
> Register TestDataflowRunner automatically.
> Simplify option's arguments when using TestDataflowRunner run end-to-end test 
> against dataflow service. Instead of using 
> {code}--runner=org.apache.beam.runners.dataflow.testing.TestDataflowRunner{code},
>  we can also use {code}--runner=TestDataflowRunner{code}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (BEAM-583) Auto Register TestDataflowRunner

2016-08-24 Thread Mark Liu (JIRA)
Mark Liu created BEAM-583:
-

 Summary: Auto Register TestDataflowRunner 
 Key: BEAM-583
 URL: https://issues.apache.org/jira/browse/BEAM-583
 Project: Beam
  Issue Type: Improvement
Reporter: Mark Liu
Assignee: Mark Liu


Register TestDataflowRunner automatically.

Simplify option's arguments when using TestDataflowRunner run end-to-end test 
against dataflow service. Instead of using 
"--runner=org.apache.beam.runners.dataflow.testing.TestDataflowRunner", we can 
also use "--runner=TestDataflowRunner".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (BEAM-561) Add WindowedWordCountIT

2016-08-18 Thread Mark Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Liu reassigned BEAM-561:
-

Assignee: Mark Liu

> Add WindowedWordCountIT
> ---
>
> Key: BEAM-561
> URL: https://issues.apache.org/jira/browse/BEAM-561
> Project: Beam
>  Issue Type: Bug
>Reporter: Jason Kuster
>Assignee: Mark Liu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (BEAM-495) Generalize FileChecksumMatcher used for all E2E test

2016-07-27 Thread Mark Liu (JIRA)
Mark Liu created BEAM-495:
-

 Summary: Generalize FileChecksumMatcher used for all E2E test
 Key: BEAM-495
 URL: https://issues.apache.org/jira/browse/BEAM-495
 Project: Beam
  Issue Type: Improvement
Reporter: Mark Liu
Assignee: Mark Liu


Refactor WordCountOnSuccessMatcher to be more general so that it can be reused 
by other tests.

Requirement:
Given input file path (accept glob) and expected checksum, generate checksum of 
file(s) and verify with expected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (BEAM-446) Improve IOChannelUtils.resolve() to accept multiple paths at once

2016-07-14 Thread Mark Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Liu closed BEAM-446.
-

> Improve IOChannelUtils.resolve() to accept multiple paths at once
> -
>
> Key: BEAM-446
> URL: https://issues.apache.org/jira/browse/BEAM-446
> Project: Beam
>  Issue Type: Improvement
>Reporter: Mark Liu
>Assignee: Mark Liu
>Priority: Minor
> Fix For: 0.2.0-incubating
>
>
> Currently, IOChannelUtils.resolve() method can only resolve one path against 
> base path. 
> It's useful to have another method with arguments that includes one base path 
> and multiple others. The return string will be a directory that start with 
> base path and append rests which are separated by file separator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (BEAM-446) Improve IOChannelUtils.resolve() to accept multiple paths at once

2016-07-12 Thread Mark Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Liu updated BEAM-446:
--
Description: 
Currently, IOChannelUtils.resolve() method can only resolve one path against 
base path. 

It's useful to have another method with arguments that includes one base path 
and multiple others. The return string will be a directory that start with base 
path and append rests which are separated by file separator.

  was:
Currently, IOChannelUtils.resolve() method can only resolve one path against 
base path. 

It's useful to have another method with arguments that includes one base path 
and multiple other paths. The return string will be a full path that start with 
base path and append rests which is separated by file separator.


> Improve IOChannelUtils.resolve() to accept multiple paths at once
> -
>
> Key: BEAM-446
> URL: https://issues.apache.org/jira/browse/BEAM-446
> Project: Beam
>  Issue Type: Improvement
>Reporter: Mark Liu
>Priority: Minor
>
> Currently, IOChannelUtils.resolve() method can only resolve one path against 
> base path. 
> It's useful to have another method with arguments that includes one base path 
> and multiple others. The return string will be a directory that start with 
> base path and append rests which are separated by file separator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (BEAM-446) Improve IOChannelUtils.resolve() to accept multiple paths at once

2016-07-12 Thread Mark Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15373860#comment-15373860
 ] 

Mark Liu commented on BEAM-446:
---

[~lcwik] assign to me?

> Improve IOChannelUtils.resolve() to accept multiple paths at once
> -
>
> Key: BEAM-446
> URL: https://issues.apache.org/jira/browse/BEAM-446
> Project: Beam
>  Issue Type: Improvement
>Reporter: Mark Liu
>Priority: Minor
>
> Currently, IOChannelUtils.resolve() method can only resolve one path against 
> base path. 
> It's useful to have another method with arguments that includes one base path 
> and multiple other paths. The return string will be a full path that start 
> with base path and append rests which is separated by file separator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (BEAM-445) Beam-examples-java build failed through local "mvn install"

2016-07-12 Thread Mark Liu (JIRA)
Mark Liu created BEAM-445:
-

 Summary: Beam-examples-java build failed through local "mvn 
install"
 Key: BEAM-445
 URL: https://issues.apache.org/jira/browse/BEAM-445
 Project: Beam
  Issue Type: Bug
 Environment: linux
Reporter: Mark Liu
Priority: Critical


Build project under beam/examples/java with command "mvn clean install 
-DskipTests" failed with following error:

[ERROR] Failed to execute goal on project beam-examples-java: Could not resolve 
dependencies for project 
org.apache.beam:beam-examples-java:jar:0.2.0-incubating-SNAPSHOT: Could not 
transfer artifact 
io.netty:netty-tcnative-boringssl-static:jar:${os.detected.classifier}:1.1.33.Fork13
 from/to central (http://repo.maven.apache.org/maven2): Illegal character in 
path at index 138: 
http://repo.maven.apache.org/maven2/io/netty/netty-tcnative-boringssl-static/1.1.33.Fork13/netty-tcnative-boringssl-static-1.1.33.Fork13-${os.detected.classifier}.jar

Reason: can't resolve ${os.detected.classifier} in 
beam/sdks/java/io/google-cloud-platform/pom file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)