Oh, and I forgot the unified model! For unbounded collections, you can't ask to access the whole collection, given you'll be waiting forever ;-) Thomas has some work in progress on making PAssert support per-window assertions though, so it'll be able to handle that case in the future.
On Thu, May 19, 2016 at 11:26 AM, Frances Perry <[email protected]> wrote: > Nope. A Beam pipeline goes through three different phases: > (1) Graph construction -- from Pipeline.create() until Pipeline.run(). > During this phase PCollection are just placeholders -- their contents don't > exist. > (2) Submission -- Pipeline.run() submits the pipeline to a runner for > execution. > (3) Execution -- The runner runs the pipeline. May continue after > Pipeline.run() returns, depending on the runner. Some PCollections may get > materialized during execution, but others may not (for example, if the > runner chooses to fuse or optimize them away). > > So in other words, there's no place in your main program when a > PCollection is guaranteed to have its contents available. > > If you are attempting to debug a job, you might look into PAssert [1], > which gives you a place to inspect the entire contents of a PCollection and > do something with it. In the DirectPipelineRunner that code will run > locally, so you'll see the output. But of course, in other runners it may > helpfully print on some random worker somewhere. But you can also write > tests that assert on its contents, which will work on any runner. (This is > how the @RunnableOnService tests work that we are setting up for all > runners.) > > We should get some documentation on testing put together on the Beam site, > but in the meantime the info on the Dataflow site [2] roughly applies > (DataflowAssert was renamed PAssert in Beam). > > Hope that helps! > > [1] > https://github.com/apache/incubator-beam/blob/eb682a80c4091dce5242ebc8272f43f3461a6fc5/sdks/java/core/src/main/java/org/apache/beam/sdk/testing/PAssert.java > [2] https://cloud.google.com/dataflow/pipelines/testing-your-pipeline > > On Wed, May 18, 2016 at 10:48 AM, Jesse Anderson <[email protected]> > wrote: > >> Is there a way to get the PCollection's results in the executing process? >> In Spark this is using the collect method on an RDD. >> >> This would be for small amounts of data stored in a PCollection like: >> >> PCollection<Long> count = partition.apply(Count.globally()); >> System.out.println(valuesincount); >> >> Thanks, >> >> Jesse >> > >
