using FileIO to read a single input file

2021-12-13 Thread Randal Moore
I have some side inputs that I would like to add to my pipeline. Some of them are based on a file pattern, so I found that I can collect the contents of those files using a pattern like the following: val genotypes = p.apply(FileIO.`match`.filepattern(opts.getGenotypesFilePattern()))

Re: Strange 'gzip error' running Beam on Dataflow

2018-10-12 Thread Randal Moore
csv) to > the output files and, as consequence, GZIP files were automatically > decompressed when downloading them (as explained in the previous link). > > Best, > > > El vie., 12 oct. 2018 23:40, Randal Moore escribió: > >> Using Beam Java SDK 2.6. >> >> I h

Strange 'gzip error' running Beam on Dataflow

2018-10-12 Thread Randal Moore
Using Beam Java SDK 2.6. I have a batch pipeline that has run successfully in its current several times. Suddenly I am getting strange errors complaining about the format of the input. As far as I know, the pipeline didn't change at all since the last successful run. The error:

Status type mismatch between different API calls

2018-02-02 Thread Randal Moore
I'm using dataflow. Found what seems to me to be "usage problem" with the available APIs. When I submit a job to the dataflow runner, I get back a DataflowPipelineJob or its superclass, PipelineResult which provides me the status of the job - as an enumerated type. But if I use a DataflowClient to

Re: [VOTE] [DISCUSSION] Remove support for Java 7

2017-10-17 Thread Randal Moore
+1 On Tue, Oct 17, 2017 at 5:21 PM Raghu Angadi wrote: > +1. > > On Tue, Oct 17, 2017 at 2:11 PM, David McNeill > wrote: > >> The final version of Beam that supports Java 7 should be clearly stated >> in the docs, so those stuck on old production

Strange errors running on DataFlow

2017-08-03 Thread Randal Moore
I have a batch pipeline that runs well with small inputs but fails with a larger dataset. Looking at stackdriver I find a fair number of the following: Request failed with code 400, will NOT retry:

Re: API to query the state of a running dataflow job?

2017-07-10 Thread Randal Moore
beam/blob/master/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowPipelineJob.java#L441 > [3] https://issues.apache.org/jira/secure/CreateIssue!default.jspa > > On Sun, Jul 9, 2017 at 2:54 PM, Randal Moore <rdmoor...@gmail.com> wrote: &g

API to query the state of a running dataflow job?

2017-07-09 Thread Randal Moore
Is this part of the Beam API or something I should look at the google docs for help? Assume a job is running in dataflow - how can an interested third-party app query the status if it knows the job-id? rdm

Re: Providing HTTP client to DoFn

2017-07-05 Thread Randal Moore
n be shared across multiple bundles >> and can contain methods marked with @Setup/@Teardown which only get invoked >> once per DoFn instance (which is relatively infrequently) and you could >> store an instance per DoFn instead of a singleton if the REST library was >> not thread s

Providing HTTP client to DoFn

2017-07-05 Thread Randal Moore
I have a step in my beam pipeline that needs some data from a rest service. The data acquired from the rest service is dependent on the context of the data being processed and relatively large. The rest client I am using isn't serializable - nor is it likely possible to make it so (background

Is this a valid usecase for Apache Beam and Google dataflow

2017-06-20 Thread Randal Moore
Just starting looking at Beam this week as a candidate for executing some fairly CPU intensive work. I am curious if the stream-oriented features of Beam are a match for my usecase. My user will submit a large number of computations to the system (as a "job"). Those computations can be expressed