Beam at DataWorks Summit Europe 2017

2017-03-28 Thread Davor Bonaci
I'm happy to share that Apache Beam will be featured next week at the
DataWorks Summit Europe 2017 in Munich, Germany [1].

Session: Unified, efficient and portable data processing with Apache Beam
[2]
Time: Wednesday, April 5, 2017, 5:50 PM

If you'll be attending the conference and would like to talk about
all-things-Beam, please reach out!

Thanks,
Davor

[1] https://dataworkssummit.com/munich-2017/
[2]
https://dataworkssummit.com/munich-2017/sessions/unified-efficient-and-portable-data-processing-with-apache-beam/


Re: Apache Beam cogroup help

2017-03-28 Thread Dan Halperin
Also as for expense -- it shouldn't be expensive, as on almost every runner
(flink included) such a ParDo will usually be run on the same machine
without any serialization

On Wed, Mar 22, 2017 at 12:25 PM, Aljoscha Krettek 
wrote:

> You can use WithKeys for that: https://beam.apache.org/
> documentation/sdks/javadoc/0.6.0/org/apache/beam/sdk/
> transforms/WithKeys.html
>
> Best,
> Aljoscha
>
> > On 22 Mar 2017, at 19:06, Newport, Billy  wrote:
> >
> > If I’m reading a parquet or avro file though, I don’t have a
> KV, I have a Data. Do I need to run a pardo just to extract the
> keys for this to work?
> >
> > PCollection data;
> > PCollection> keyedData = “data par do’ed to
> create KV for each GenericRecord, extracting possibly multiple field PKs
> encoded as a string”
> >
> > Then do the stuff below. This seems pretty expensive (serialization
> wise) compared with the flink Keyextractor for example or is it similar in
> practice?
> >
> > Thanks Thomas.
> >
> > From: Thomas Groh [mailto:tg...@google.com]
> > Sent: Wednesday, March 22, 2017 1:53 PM
> > To: user@beam.apache.org
> > Subject: Re: Apache Beam cogroup help
> >
> > This would be implemented via a CoGroupByKey (https://beam.apache.org/
> documentation/sdks/javadoc/0.6.0/org/apache/beam/sdk/
> transforms/join/CoGroupByKey.html)
> >
> > Your transform logic will be mostly the same; after applying the
> extraction (the right side of k1 and k2 in your example), you should have
> two PCollections of KVs -
> >
> > PCollection> k1;
> > PCollection> k2;
> >
> > You can construct a KeyedPCollectionTuple containing the two
> PCollections:
> >
> > final TupleTag data1Tag = new TupleTag<>();
> > final TupleTag data2Tag = new TupleTag<>();
> > KeyedPCollectionTuple coGroupTuple = KeyedPCollectionTuple.of(data1Tag,
> k1).and(data2Tag, k2);
> >
> > Then apply the CoGroupByKey:
> >
> > PColection> coGrouped = coGroupTuple.apply(
> CoGroupByKey.create());
> >
> > Then you can run an arbitrary ParDo to combine the elements as
> appropriate. You'll need to reuse the TupleTags created above to extract
> out the per-PCollection outputs. As a simple example where the elements
> have a shared supertype CombinedData, and you'd like to add them to a
> single output list:
> >
> > PCollection> combined =
> coGrouped.apply(ParDo.of(new DoFn, KV List>>() {
> >   @ProcessElement
> >   public void process(ProcessContext context) {
> > List all = new ArrayList<>();
> > for (Data1 d1 : context.element().value().getAll(data1Tag)) {
> >   all.add(d1);
> > }
> > for (Data2 d2 : context.element().value().getAll(data2Tag)) {
> >   all.add(d2);
> > }
> > context.output(all);
> >   }
> > }));
> >
> > On Wed, Mar 22, 2017 at 10:35 AM, Newport, Billy 
> wrote:
> > Trying to port flink code to Apache Beam but I’m having trouble decoding
> the documentation.
> >
> > I have flink code which looks like:
> >
> > DataSet d1 = Read parquet
> > DataSet d2 = Read Avro
> > KeyExtractor k1 = … (extracts an object containing the
> key fields from d1 records)
> > KeyExtractor k2 = … (extracts an object containing the
> key fields from d2 records)
> >
> > CoGroup grouper = (combines
> values for equal keys in to a combined list for that key)
> >
> > DataSet combined = d1.coGroup(d2).where(k1).
> equalTo(k2).with(grouper)
> >
> > Whats the beam equivalent?
> >
> > Thanks
>
>


Re: Beam partitioned file reading and writing

2017-03-28 Thread Robert Bradshaw
I would simply use a Partition transform to convert the
PCollection>
data to a list of PCollection and then write each one as
desired. Either that or a DoFn with multiple side outputs (one for each
enum value). There's no reason this should perform poorly on the runners I
am aware of, and is the most straightforward (and least error-prone)
solution. (I don't know what your original solution looked like, but if you
used N filters rather than one DoFn with N outputs that would perform much
worse.)

https://beam.apache.org/documentation/programming-guide/#transforms-flatten-partition

On Tue, Mar 28, 2017 at 1:04 AM, Narek Amirbekian 
wrote:

> We had to implement a custom sink in order to get partitioned destination.
> This is the least hacky solution, but the down-side of this approach is and
> it is a lot of code that custom sinks only work in batch mode. If you need
> to do this in streaming mode you may be able to get away with writing as a
> side-effect of a pardo, but you have to account for concurrent writes, and
> chunks might be processed more than once so you have to account for
> duplicate values in your output.
>
> On Thu, Mar 23, 2017 at 9:46 AM Newport, Billy 
> wrote:
>
>> Is there builtin support for writing partitioned Collections. For example:
>>
>>
>>
>> PCollection> data;
>>
>>
>>
>> We want to write the GenericRecords in data in to different files based
>> on the enum. We’ve did this in flink by making a proxy hadoopoutputformat
>> which has the N real outputformats and the write method checks the enum and
>> forwards the write call for the genericrecord to the correct outputformat.
>>
>>
>>
>> Given the lack of beam parquet support, the final writer we want to use
>> with beam is Avro.
>>
>>
>>
>> We used the proxy outputformat trick in flink because performance was
>> very poor using a filter to split it and then a map to convert from
>> Enum,GenericRecord to just GenericRecord.
>>
>>
>>
>> I’m nervous to use side outputs in beam given I think they will be
>> implemented as described here which performs poorly.
>>
>>
>>
>> So basically, has anyone implemented a demuxing AvroIO.Writer?
>>
>>
>>
>> Thanks
>>
>


Re: Beam partitioned file reading and writing

2017-03-28 Thread Narek Amirbekian
We had to implement a custom sink in order to get partitioned destination.
This is the least hacky solution, but the down-side of this approach is and
it is a lot of code that custom sinks only work in batch mode. If you need
to do this in streaming mode you may be able to get away with writing as a
side-effect of a pardo, but you have to account for concurrent writes, and
chunks might be processed more than once so you have to account for
duplicate values in your output.

On Thu, Mar 23, 2017 at 9:46 AM Newport, Billy  wrote:

> Is there builtin support for writing partitioned Collections. For example:
>
>
>
> PCollection> data;
>
>
>
> We want to write the GenericRecords in data in to different files based on
> the enum. We’ve did this in flink by making a proxy hadoopoutputformat
> which has the N real outputformats and the write method checks the enum and
> forwards the write call for the genericrecord to the correct outputformat.
>
>
>
> Given the lack of beam parquet support, the final writer we want to use
> with beam is Avro.
>
>
>
> We used the proxy outputformat trick in flink because performance was very
> poor using a filter to split it and then a map to convert from
> Enum,GenericRecord to just GenericRecord.
>
>
>
> I’m nervous to use side outputs in beam given I think they will be
> implemented as described here which performs poorly.
>
>
>
> So basically, has anyone implemented a demuxing AvroIO.Writer?
>
>
>
> Thanks
>