Hi folks,
I feel a little daft asking this, and suspect I am missing the obvious...
Can someone please tell me how I can do a ParDo following a Partition?
In spark I'd just repartition(...) and then a map() but I don't spot in the
Beam API how to run a ParDo on each partition in parallel. Do I
implements Deserializer,
> though I suppose with proper configuration that Object will at run-time be
> your desired type. Have you tried adding some Java type casts to make it
> compile?
>
>
> +1, cast might be the simplest fix. Alternately you can wrap or
> extend KafkaAvroDeser
es);
> }
>
> @Override
> public void close() {}
> }
>
> Nicer than my solution so think that is the one I'm going to go with for
> now.
>
> Thanks,
> Andrew
>
>
> On Thu, Oct 19, 2017, at 10:20 AM, Tim Robertson wrote:
>
> Hi Andrew,
>
> I also saw
O.<String, Envelope>read()
> .withValueDeserializerAndCoder((Deserializer)
> KafkaAvroDeserializer.class,
> AvroCoder.of(Envelope.class))
>
> On Thu, Oct 19, 2017 at 11:45 AM Tim Robertson <timrobertson...@gmail.com>
> wrote:
>
>> Hi Raghu
>>
>> I tried that but with KafkaAvro
Hi Ryan,
I can confirm 2.2.0-SNAPSHOT works fine with an ES 5.6 cluster. I am told
2.2.0 is expected within a couple weeks.
My work is only a proof of concept for now, but I put in 300M fairly small
docs at around 100,000/sec on a 3 node cluster without any issue [1].
Hope this helps,
Tim
[1]
Hi Chet,
I'll be a user of this, so thank you.
It seems reasonable although - did you consider letting folk name the
document ID field explicitly? It would avoid an unnecessary transformation
and might be simpler:
// instruct the writer to use a provided document ID
Thanks JB
On "thoughts":
- Cloudera 5.13 will still default to 1.6 even though a 2.2 parcel is
available (HWX provides both)
- Cloudera support for spark 2 has a list of exceptions (
https://www.cloudera.com/documentation/spark2/latest/topics/spark2_known_issues.html
)
- I am not sure if the
Hi Chet,
+1 for interest in this from me too.
If it helps, I'd have expected a) to be the implementation (e.g. something
like "_id" being used if present) and handing multiple delivery being a
responsibility of the developer.
Thanks,
Tim
On Wed, Nov 15, 2017 at 10:08 AM, Jean-Baptiste
.withValueDeserializerAndCoder
>> ((Class)KafkaAvroDeserializer.class, AvroCoder.of(Envelope.class))
>>
>>
>> On Thu, Oct 19, 2017 at 12:21 PM Raghu Angadi <rang...@google.com> wrote:
>>
>>> Same for me. It does not look like there is an annotation
Beam supports it, we
> decided to postpone the feature (we have a fix that works for us, for now).
>
> When Beam supports ES6, I’ll be happy to make a contribution to get bulk
> deletes working.
>
>
>
> For reference, I opened a ticket (https://issues.apache.org/
> jira/browse/BEA
Hi Kelsey
Does the example [1] in the docs demonstrate differing generic types when
using withOutputTags()?
Could something like the following work for you?
final TupleTag type1Records =
final TupleTag type2Records =
final TupleTag invalidRecords = // CSVInvalidLine holds
e.g. an ID and
Hi Wout,
This is great, thank you. I wrote the partial update support you reference
and I'll be happy to mentor you through your first PR - welcome aboard. Can
you please open a Jira to reference this work and we'll assign it to you?
We discussed having the "_xxx" fields in the document and
Hi Juan,
You are correct that BEAM-2277 seems to be recurring. I have today stumbled
upon that myself in my own pipeline (not word count).
I have just posted a workaround at the bottom of the issue, and will reopen
the issue.
Thank you for reminding us on this,
Tim
On Mon, Aug 20, 2018 at 4:44
I answered on SO Kelsey,
You should be able to add this I believe to explicitly declare the coder to
use:
p.getCoderRegistry()
.registerCoderForClass(HCatRecord.class,
WritableCoder.of(DefaultHCatRecord.class));
On Fri, Jul 20, 2018 at 5:05 PM, Kelsey RIDER wrote:
> Hello,
>
>
>
> I wrote
I took a super quick look at the code and I think Romain is correct.
1. On a retry scenario it calls handleRetry()
2. Within handleRetry() it gets the DefaultRetryPredicate and calls
test(response) - this reads the response stream to JSON
3. When the retry is successful (no 429 code) the response
Hi Juan
This sounds reminiscent of https://issues.apache.org/jira/browse/BEAM-2277
which we believed fixed in 2.7.0.
What IO are you using to write your files and can you paste a snippet of
your code please?
On BEAM-2277 I posted a workaround for AvroIO (it might help you find a
workaround too):
Hi Juan
Well done for diagnosing your issue and thank you for taking the time to
report it here.
I'm not the author of this section but I've taken a quick look at the code
and in line comments and have some observations which I think might help
explain it.
I notice it writes into temporary
To clarify Ismaël's comment
Cloudera repo indicates Cloudera 6.1 will have spark 2.4 but CDH is
currently still on 6.0.
https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/spark/spark-core_2.11/2.4.0-cdh6.1.0/
With the HWX / Cloudera merger the release cycle is not announced
This is great. Thanks Pablo and all
I've seen several folk struggle with writing avro to dynamic locations
which I think might be a good addition. If you agree I'll offer a PR unless
someone gets there first - I have an example here:
Another +1 to support your research into this Chad. Thank you.
Trying to understand where a beam process is in the Spark DAG is... not
easy. A UI that helped would be a great addition.
On Wed, Jun 26, 2019 at 3:30 PM Ismaël Mejía wrote:
> +1 don't hesitate to create a JIRA + PR. You may be
Hi Jordan
I don't know if we qualify as a large Beam project but at GBIF.org we bring
together datasets from 1600+ institutions documenting 1,4B observations of
species (museum data, citizen science, environmental reports etc).
As far as Beam goes though, we aren't using the most advanced
My apologies, I missed the link:
[1] https://github.com/gbif/pipelines
On Tue, Apr 21, 2020 at 5:58 PM Tim Robertson
wrote:
> Hi Jordan
>
> I don't know if we qualify as a large Beam project but at GBIF.org we
> bring together datasets from 1600+ institutions documenting 1,4B
&g
22 matches
Mail list logo