You can use WithKeys for that: https://beam.apache.org/documentation/sdks/javadoc/0.6.0/org/apache/beam/sdk/transforms/WithKeys.html
Best, Aljoscha > On 22 Mar 2017, at 19:06, Newport, Billy <[email protected]> wrote: > > If I’m reading a parquet or avro file though, I don’t have a KV<K,Data1>, I > have a Data. Do I need to run a pardo just to extract the keys for this to > work? > > PCollection<GenericRecord> data; > PCollection<KV<String,GenericRecord>> keyedData = “data par do’ed to create > KV for each GenericRecord, extracting possibly multiple field PKs encoded as > a string” > > Then do the stuff below. This seems pretty expensive (serialization wise) > compared with the flink Keyextractor for example or is it similar in practice? > > Thanks Thomas. > > From: Thomas Groh [mailto:[email protected]] > Sent: Wednesday, March 22, 2017 1:53 PM > To: [email protected] > Subject: Re: Apache Beam cogroup help > > This would be implemented via a CoGroupByKey > (https://beam.apache.org/documentation/sdks/javadoc/0.6.0/org/apache/beam/sdk/transforms/join/CoGroupByKey.html) > > Your transform logic will be mostly the same; after applying the extraction > (the right side of k1 and k2 in your example), you should have two > PCollections of KVs - > > PCollection<KV<K, Data1>> k1; > PCollection<KV<K, Data2>> k2; > > You can construct a KeyedPCollectionTuple containing the two PCollections: > > final TupleTag<Data1> data1Tag = new TupleTag<>(); > final TupleTag<Data2> data2Tag = new TupleTag<>(); > KeyedPCollectionTuple<K> coGroupTuple = KeyedPCollectionTuple.of(data1Tag, > k1).and(data2Tag, k2); > > Then apply the CoGroupByKey: > > PColection<KV<K, CoGroupByKeyResult>> coGrouped = > coGroupTuple.apply(CoGroupByKey.<K>create()); > > Then you can run an arbitrary ParDo to combine the elements as appropriate. > You'll need to reuse the TupleTags created above to extract out the > per-PCollection outputs. As a simple example where the elements have a shared > supertype CombinedData, and you'd like to add them to a single output list: > > PCollection<KV<K, List<CombinedData>> combined = coGrouped.apply(ParDo.of(new > DoFn<KV<K, CoGroupByKeyResult>, KV<K, List<CombinedData>>>() { > @ProcessElement > public void process(ProcessContext context) { > List<CombinedData> all = new ArrayList<>(); > for (Data1 d1 : context.element().value().getAll(data1Tag)) { > all.add(d1); > } > for (Data2 d2 : context.element().value().getAll(data2Tag)) { > all.add(d2); > } > context.output(all); > } > })); > > On Wed, Mar 22, 2017 at 10:35 AM, Newport, Billy <[email protected]> wrote: > Trying to port flink code to Apache Beam but I’m having trouble decoding the > documentation. > > I have flink code which looks like: > > DataSet<GenericRecord> d1 = Read parquet > DataSet<GenericRecord> d2 = Read Avro > KeyExtractor<GenericRecord> k1 = … (extracts an object containing the key > fields from d1 records) > KeyExtractor<GenericRecord> k2 = … (extracts an object containing the key > fields from d2 records) > > CoGroup<GenericRecord,GenericRecord,GenericRecord> grouper = (combines values > for equal keys in to a combined list for that key) > > DataSet<GenericRecord> combined = > d1.coGroup(d2).where(k1).equalTo(k2).with(grouper) > > Whats the beam equivalent? > > Thanks
