[ANNOUNCE] Beam 2.27.0 Released

2021-01-08 Thread Pablo Estrada
The Apache Beam team is pleased to announce the release of version 2.27.0. Apache Beam is an open source unified programming model to define and execute data processing pipelines, including ETL, batch and stream (continuous) processing. See https://beam.apache.org You can download the release her

Re: Quick question regarding ParquetIO

2021-01-08 Thread Kobe Feng
Tao, I'm not an expert, and good intuition, all you want is schema awareness transformations or let's say schema based transformation in Beam not only for IO but also for other DoFn, etc, and possibly have schema revolution in future as well. This is how I try to understand and explain in other pl

Re: Quick question regarding ParquetIO

2021-01-08 Thread Tao Li
Thanks Alexey for your explanation. That’s also what I was thinking. Parquet files already have the schema built in, so it might be feasible to infer a coder automatically (like spark parquet reader). It would be great if we have some experts chime in here. @Brian Hulette

Side Inputs or CoGroupByKey

2021-01-08 Thread Gruß , Hendrik
Hi everybody, I would like to hear your thoughts on which technique would be used in Apache Beam for the following problem: *Problem definition*: I have two streams of data, one with pageviews of users, and another with requests of the users. They share the key session_id which describes the use

Re: Quick question regarding ParquetIO

2021-01-08 Thread Alexey Romanenko
Well, this is how I see it, let me explain. Since every PCollection is required to have a Coder to materialize the intermediate data, we need to have a coder for "PCollection" as well. If I’m not mistaken, for “GenericRecord" we used to set AvroCoder that is based on Avro (or Beam too?) schema

Re: Side input in streaming

2021-01-08 Thread Manninger, Matyas
Dear Kenn, Thanks again, that pattern was my initial plan but there seems to be a bug in the python API in the periodicsequence.py on line 42 "total_outputs = math.ceil((end - start) / interval)". Here end start and interval are all Durations and the / operator is not defined for the Duration clas