Re: Choosing a coder for a class that contains a Row?

Ryan Skraba Tue, 23 Jul 2019 01:07:53 -0700

Hello Pablo!  Just to clarify -- the Row schemas aren't known at
pipeline construction time, but can be discovered from the instance of
MyData?

Once discovered, is the schema "homogeneous" for all instance of
MyData?  (i.e. someRow will always have the same schema for all
instances afterwards, and there won't be another someRow with a
different schema).

We've encountered a parallel "problem" with pure Avro data, where the
instance is a GenericRecord containing it's own Avro schema but
*without* knowing the schema until the pipeline is run.  The solution
that we've been using is a bit hacky, but we're using an ad hoc
per-job schema registry and a custom coder where each worker saves the
schema in the `encode` before writing the record, and loads it lazily
in the `decode` before reading.

The original code is available[1] (be gentle, it was written with Beam
0.4.0-incubating... and has continued to work until now).

In practice, the ad hoc schema registry is just a server socket in the
Spark driver, in-memory for DirectRunner / local mode, and a a
read/write to a known location in other runners.  There are definitely
other solutions with side-inputs and providers, and the job server in
portability looks like an exciting candidate for per-job schema
registry story...

I'm super eager to see if there are other ideas or a contribution we
can make in this area that's "Beam Row" oriented!

Ryan

[1] 
https://github.com/Talend/components/blob/master/core/components-adapter-beam/src/main/java/org/talend/components/adapter/beam/coders/LazyAvroCoder.java

On Tue, Jul 23, 2019 at 12:49 AM Pablo Estrada <[email protected]> wrote:
>
> Hello all,
> I am writing a utility to push data to PubSub. My data class looks something 
> like so:
> ==========
> class MyData {
>   String someId;
>   Row someRow;
>   Row someOtherRow;
> }
> ==============
> The schema for the Rows is not known a-priori. It is contained by the Row. I 
> am then pushing this data to pubsub:
> ===========
> MyData pushingData = ....
> WhatCoder? coder = ....
>
> ByteArrayOutputStream os = new ByteArrayOutputStream();
> coder.encode(this, os);
>
> pubsubClient.connect();
> pubsubClient.push(PubSubMessage.newBuilder().setData(os.toByteArray()).build());
> pubsubClient.close();
> =================
> What's the right coder to use in this case? I don't know if SchemaCoder will 
> work, because it seems that it requires the Row's schema a priori. I have not 
> been able to make AvroCoder work.
>
> Any tips?
> Best
> -P.

Re: Choosing a coder for a class that contains a Row?

Reply via email to