Run a ParDo after a Partition

2017-10-03 Thread Tim Robertson
Hi folks, I feel a little daft asking this, and suspect I am missing the obvious... Can someone please tell me how I can do a ParDo following a Partition? In spark I'd just repartition(...) and then a map() but I don't spot in the Beam API how to run a ParDo on each partition in parallel. Do I

Re: KafkaIO and Avro

2017-10-19 Thread Tim Robertson
implements Deserializer, > though I suppose with proper configuration that Object will at run-time be > your desired type. Have you tried adding some Java type casts to make it > compile? > > > +1, cast might be the simplest fix. Alternately you can wrap or > extend KafkaAvroDeser

Re: KafkaIO and Avro

2017-10-19 Thread Tim Robertson
es); > } > > @Override > public void close() {} > } > > Nicer than my solution so think that is the one I'm going to go with for > now. > > Thanks, > Andrew > > > On Thu, Oct 19, 2017, at 10:20 AM, Tim Robertson wrote: > > Hi Andrew, > > I also saw

Re: KafkaIO and Avro

2017-10-19 Thread Tim Robertson
O.<String, Envelope>read() > .withValueDeserializerAndCoder((Deserializer) > KafkaAvroDeserializer.class, > AvroCoder.of(Envelope.class)) > > On Thu, Oct 19, 2017 at 11:45 AM Tim Robertson <timrobertson...@gmail.com> > wrote: > >> Hi Raghu >> >> I tried that but with KafkaAvro

Re: ElasticSearch with RestHighLevelClient

2017-10-23 Thread Tim Robertson
Hi Ryan, I can confirm 2.2.0-SNAPSHOT works fine with an ES 5.6 cluster. I am told 2.2.0 is expected within a couple weeks. My work is only a proof of concept for now, but I put in 300M fairly small docs at around 100,000/sec on a 3 node cluster without any issue [1]. Hope this helps, Tim [1]

Re: Does ElasticsearchIO in the latest RC support adding document IDs?

2017-11-15 Thread Tim Robertson
Hi Chet, I'll be a user of this, so thank you. It seems reasonable although - did you consider letting folk name the document ID field explicitly? It would avoid an unnecessary transformation and might be simpler: // instruct the writer to use a provided document ID

Re: [DISCUSS] Drop Spark 1.x support to focus on Spark 2.x

2017-11-13 Thread Tim Robertson
Thanks JB On "thoughts": - Cloudera 5.13 will still default to 1.6 even though a 2.2 parcel is available (HWX provides both) - Cloudera support for spark 2 has a list of exceptions ( https://www.cloudera.com/documentation/spark2/latest/topics/spark2_known_issues.html ) - I am not sure if the

Re: Does ElasticsearchIO in the latest RC support adding document IDs?

2017-11-15 Thread Tim Robertson
Hi Chet, +1 for interest in this from me too. If it helps, I'd have expected a) to be the implementation (e.g. something like "_id" being used if present) and handing multiple delivery being a responsibility of the developer. Thanks, Tim On Wed, Nov 15, 2017 at 10:08 AM, Jean-Baptiste

Re: KafkaIO and Avro

2017-10-19 Thread Tim Robertson
.withValueDeserializerAndCoder >> ((Class)KafkaAvroDeserializer.class, AvroCoder.of(Envelope.class)) >> >> >> On Thu, Oct 19, 2017 at 12:21 PM Raghu Angadi <rang...@google.com> wrote: >> >>> Same for me. It does not look like there is an annotation

Re: ElasticsearchIO bulk delete

2018-07-30 Thread Tim Robertson
Beam supports it, we > decided to postpone the feature (we have a fix that works for us, for now). > > When Beam supports ES6, I’ll be happy to make a contribution to get bulk > deletes working. > > > > For reference, I opened a ticket (https://issues.apache.org/ > jira/browse/BEA

Re: Pipeline error handling

2018-07-26 Thread Tim Robertson
Hi Kelsey Does the example [1] in the docs demonstrate differing generic types when using withOutputTags()? Could something like the following work for you? final TupleTag type1Records = final TupleTag type2Records = final TupleTag invalidRecords = // CSVInvalidLine holds e.g. an ID and

Re: ElasticsearchIO bulk delete

2018-07-27 Thread Tim Robertson
Hi Wout, This is great, thank you. I wrote the partial update support you reference and I'll be happy to mentor you through your first PR - welcome aboard. Can you please open a Jira to reference this work and we'll assign it to you? We discussed having the "_xxx" fields in the document and

Re: Regression of (BEAM-2277) - IllegalArgumentException when using Hadoop file system for WordCount example.

2018-08-20 Thread Tim Robertson
Hi Juan, You are correct that BEAM-2277 seems to be recurring. I have today stumbled upon that myself in my own pipeline (not word count). I have just posted a workaround at the bottom of the issue, and will reopen the issue. Thank you for reminding us on this, Tim On Mon, Aug 20, 2018 at 4:44

Re: Beam and Hive

2018-07-20 Thread Tim Robertson
I answered on SO Kelsey, You should be able to add this I believe to explicitly declare the coder to use: p.getCoderRegistry() .registerCoderForClass(HCatRecord.class, WritableCoder.of(DefaultHCatRecord.class)); On Fri, Jul 20, 2018 at 5:05 PM, Kelsey RIDER wrote: > Hello, > > > > I wrote

Re: ElasticIO retry configuration exception

2018-10-11 Thread Tim Robertson
I took a super quick look at the code and I think Romain is correct. 1. On a retry scenario it calls handleRetry() 2. Within handleRetry() it gets the DefaultRetryPredicate and calls test(response) - this reads the response stream to JSON 3. When the retry is successful (no 429 code) the response

Re: No filesystem found for scheme hdfs - with the FlinkRunner

2018-10-26 Thread Tim Robertson
Hi Juan This sounds reminiscent of https://issues.apache.org/jira/browse/BEAM-2277 which we believed fixed in 2.7.0. What IO are you using to write your files and can you paste a snippet of your code please? On BEAM-2277 I posted a workaround for AvroIO (it might help you find a workaround too):

Re: AvroIO - failure using direct runner with java.nio.file.FileAlreadyExistsException when moving from temp to destination

2018-09-26 Thread Tim Robertson
Hi Juan Well done for diagnosing your issue and thank you for taking the time to report it here. I'm not the author of this section but I've taken a quick look at the code and in line comments and have some observations which I think might help explain it. I notice it writes into temporary

Re: Moving to spark 2.4

2018-12-07 Thread Tim Robertson
To clarify Ismaël's comment Cloudera repo indicates Cloudera 6.1 will have spark 2.4 but CDH is currently still on 6.0. https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/spark/spark-core_2.11/2.4.0-cdh6.1.0/ With the HWX / Cloudera merger the release cycle is not announced

Re: [ANNOUNCEMENT] Common Pipeline Patterns - new section in the documentation + contributions welcome

2019-06-07 Thread Tim Robertson
This is great. Thanks Pablo and all I've seen several folk struggle with writing avro to dynamic locations which I think might be a good addition. If you agree I'll offer a PR unless someone gets there first - I have an example here:

Re: gRPC method to get a pipeline definition?

2019-06-26 Thread Tim Robertson
Another +1 to support your research into this Chad. Thank you. Trying to understand where a beam process is in the Spark DAG is... not easy. A UI that helped would be a great addition. On Wed, Jun 26, 2019 at 3:30 PM Ismaël Mejía wrote: > +1 don't hesitate to create a JIRA + PR. You may be

Re: Large public Beam projects?

2020-04-21 Thread Tim Robertson
Hi Jordan I don't know if we qualify as a large Beam project but at GBIF.org we bring together datasets from 1600+ institutions documenting 1,4B observations of species (museum data, citizen science, environmental reports etc). As far as Beam goes though, we aren't using the most advanced

Re: Large public Beam projects?

2020-04-21 Thread Tim Robertson
My apologies, I missed the link: [1] https://github.com/gbif/pipelines On Tue, Apr 21, 2020 at 5:58 PM Tim Robertson wrote: > Hi Jordan > > I don't know if we qualify as a large Beam project but at GBIF.org we > bring together datasets from 1600+ institutions documenting 1,4B &g