Re: Writing Out List

2016-05-24 Thread Kenneth Knowles
Hi Jesse, Yes, within a PCollection the lists keep their internal order - they are "just values" from the perspective of Beam. So the output from Top is sorted and will remain sorted, there is just no ordering between the lists. If you want to assemble sorted output by joining together such

Re: Writing Out List

2016-05-24 Thread Jesse Anderson
Understanding these ordering guarantees is fundamental. Is my understanding of the ordering guarantees for Top and List correct? On Fri, May 20, 2016, 6:48 PM Jesse Anderson wrote: > Here's the output I'm looking for (and getting): > 2016-01-11T23:59:59.998Z low 682 >

Re: Parallelism

2016-05-24 Thread Jean-Baptiste Onofré
I second Thomas: thanks for the details explanation (I forgot the mention the "unique" JVM ;)). Regards JB On 05/24/2016 07:28 PM, Thomas Groh wrote: More specifically, the InProcessPipelineRunner (soon to be renamed to the DirectRunner) will run on a single machine, with a number of threads

Re: Parallelism

2016-05-24 Thread Thomas Groh
More specifically, the InProcessPipelineRunner (soon to be renamed to the DirectRunner) will run on a single machine, with a number of threads based on the number of available processors in the JVM, fanning out work to these threads as appropriate; It will not perform any cross-process (including

Re: Generating a historically-consistent join

2016-05-24 Thread Mark Shields
Hi Ryan, perhaps this is https://issues.apache.org/jira/browse/BEAM-197 ? On Mon, May 23, 2016 at 6:47 PM, Ryan Madsen wrote: > Hi all, > > I'm looking to solve a problem related to performing a join on two > streaming datasets, and am having a hard time figuring out if

Re: Parallelism

2016-05-24 Thread Jean-Baptiste Onofré
Hi David, if you use the InProcessPipelineRunner (the "new" DirectPipelineRunner), than it can creates several threads. Regards JB On 05/24/2016 04:38 PM, David Olsen wrote: A naive question about DirectPipelineRunner: Is it possible to execute DirectPipelineRunner with multiple threads/

Parallelism

2016-05-24 Thread David Olsen
A naive question about DirectPipelineRunner: Is it possible to execute DirectPipelineRunner with multiple threads/ instances (across machines) or the parallelism is only supported by runner such as SparkPipelineRunner? My requirement is to run pipeline in parallel, either threading or multiple

Re: expected a valid 'gs://' path but was given '/tmp/tmpLocation'

2016-05-24 Thread Davor Bonaci
Yes -- MinimalWordCount example currently defaults to the DataflowPipelineRunner, which runs pipelines on the Google Cloud Dataflow service. (We'll be changing this.) In general, Cloud-based runners don't have access to your local machine, hence the exception you saw. DirectPipelineRunner can

Re: expected a valid 'gs://' path but was given '/tmp/tmpLocation'

2016-05-24 Thread Robertson Williams
Just find out what goes wrong. Changing to use org.apache.beam.sdk.options.DirectPipelineOptions org.apache.beam.sdk.runners.DirectPipelineRunner fixing the problem. Thanks On Tue, May 24, 2016 at 6:24 PM, Robertson Williams wrote: > I try with the latest version

expected a valid 'gs://' path but was given '/tmp/tmpLocation'

2016-05-24 Thread Robertson Williams
I try with the latest version 0.1.0-SNAPSHOT cloned from git, but when testing with MinimalWordCount, it throws expected a valid 'gs://' path but was given '/tmp/tmpLocation' Can I run MinimalWordCount example locally (by supplying tmp location at local file system e.g. file://) or is it