Re: [ANNOUNCE] Beam 2.19.0 Released

2020-02-04 Thread Hannah Jiang
Thanks Boyuan! On Tue, Feb 4, 2020 at 4:46 PM Connell O'Callaghan wrote: > Well done and thank you Boyuan (and all involved)!!! > > On Tue, Feb 4, 2020 at 4:25 PM Boyuan Zhang wrote: > >> The Apache Beam team is pleased to announce the release of version 2.19.0 >> . >> >> Apache Beam is an ope

Re: [ANNOUNCE] Beam 2.19.0 Released

2020-02-04 Thread Connell O'Callaghan
Well done and thank you Boyuan (and all involved)!!! On Tue, Feb 4, 2020 at 4:25 PM Boyuan Zhang wrote: > The Apache Beam team is pleased to announce the release of version 2.19.0. > > Apache Beam is an open source unified programming model to define and > execute data processing pipelines, incl

[ANNOUNCE] Beam 2.19.0 Released

2020-02-04 Thread Boyuan Zhang
The Apache Beam team is pleased to announce the release of version 2.19.0. Apache Beam is an open source unified programming model to define and execute data processing pipelines, including ETL, batch and stream (continuous) processing. See https://beam.apache.org You can download the release her

Dropping expired sessions with Apache Beam

2020-02-04 Thread Juliana Pereira
I have a log web log file that contains sessions id's and interactions, there are three interactions `GET, LOGIN, LOGOUT`. Something like: ``` 00:00:01;session1;GET 00:00:03;session2;LOGIN 00:01:01;session1;LOGOUT 00:03:01;session2;GET 00:08:15;session2;GET ``` and goes on. I want to be able to

Seattle Beam Meetup - March 2

2020-02-04 Thread Aizhamal Nurmamat kyzy
Hello everyone, We are hosting a Beam Meetup in Seattle on March 2! If you are in the Seattle area please come and join us at Google office in South Lake Union. Meetup agenda: 18:00 - Registration, speed networking, food and drinks. 18:30 - Encoding free-text drug names in electronic health recor

Re: dataflow job was working fine last night and it isn't now

2020-02-04 Thread Alan Krumholz
Seems like the image we use in KFP to orchestrate the job has cloudpickle==0.8.1 and that one doesn't seem to cause issues. I think I'm unblock for now but I'm sure I won't be the last one to try to do this using GCP managed notebooks :( Thanks for all the help! On Tue, Feb 4, 2020 at 12:24 PM A

Re: dataflow job was working fine last night and it isn't now

2020-02-04 Thread Alan Krumholz
I'm using a managed notebook instance from GCP It seems those already come with cloudpickle==1.2.2 as soon as you provision it. apache-beam[gcp] will then install dill==0.3.1.1 I'm going to try to uninstall cloudpickle before installing apache-beam and see if this fixes the problem Thank you On T

Re: dataflow job was working fine last night and it isn't now

2020-02-04 Thread Valentyn Tymofieiev
The fact that you have cloudpickle==1.2.2 further confirms that you may be hitting the same error as https://stackoverflow.com/questions/42960637/python-3-5-dill-pickling-unpickling-on-different-servers-keyerror-classtype . Could you try to start over with a clean virtual environment? On Tue, Fe

Re: dataflow job was working fine last night and it isn't now

2020-02-04 Thread Alan Krumholz
Hi Valentyn, Here is my pip freeze on my machine (note that the error is in dataflow, the job runs fine in my machine) ansiwrap==0.8.4 apache-beam==2.19.0 arrow==0.15.5 asn1crypto==1.3.0 astroid==2.3.3 astropy==3.2.3 attrs==19.3.0 avro-python3==1.9.1 azure-common==1.1.24 azure-storage-blob==2.1.0

Re: dataflow job was working fine last night and it isn't now

2020-02-04 Thread Valentyn Tymofieiev
It don't think there is a mismatch between dill versions here, but https://stackoverflow.com/questions/42960637/python-3-5-dill-pickling-unpickling-on-different-servers-keyerror-classtype mentions a similar error and may be related. What is the output of pip freeze on your machine (or better: pip i

Re: dataflow job was working fine last night and it isn't now

2020-02-04 Thread Alan Krumholz
BTW it doesn't seem to be related to the BQ sink. My job is failing now too without that part (and it wasn't earlier today): def test_error( bq_table: str) -> str: import apache_beam as beam from apache_beam.options.pipeline_options import PipelineOptions class GenData(beam.DoFn)

Re: dataflow job was working fine last night and it isn't now

2020-02-04 Thread Alan Krumholz
Here is a test job that sometimes fails and sometimes doesn't (but most times do). There seems to be something stochastic that causes this as after several tests a couple of them did succeed def test_error( bq_table: str) -> str: import apache_beam as beam from apache_beam.op

Re: Kafka Avro Schema Registry Support

2020-02-04 Thread rahul patwari
Thanks Ismael for the update. Thanks Alexey for the enhancement. We will test it with 2.20 release. On Tue, 4 Feb 2020, 10:53 pm Ismaël Mejía, wrote: > Support for Confluent Schema Registry was merged into KafkaIO today. You > can > test it with tomorrow's snapshots (version 2.20.0-SNAPSHOT) or

Re: dataflow job was working fine last night and it isn't now

2020-02-04 Thread Mikhail Gryzykhin
Hi Alan, +Valentyn Tymofieiev Can you verify if my assumption is correct? It seems that the problem might come from dill version mismatch. Dill version should match on worker and user code. Between Beam 2.17 and Beam 2.18 we upgraded dill version to 0.3.1.1 which has an incompatible format with

Re: dataflow job was working fine last night and it isn't now

2020-02-04 Thread Alan Krumholz
I tried breaking apart my pipeline. Seems the step that breaks it is: beam.io.WriteToBigQuery Let me see if I can create a self contained example that breaks to share with you Thanks! On Tue, Feb 4, 2020 at 9:53 AM Pablo Estrada wrote: > Hm that's odd. No changes to the pipeline? Are you able

Re: dataflow job was working fine last night and it isn't now

2020-02-04 Thread Pablo Estrada
Hm that's odd. No changes to the pipeline? Are you able to share some of the code? +Udi Meiri do you have any idea what could be going on here? On Tue, Feb 4, 2020 at 9:25 AM Alan Krumholz wrote: > Hi Pablo, > This is strange... it doesn't seem to be the last beam release as last > night it wa

Re: dataflow job was working fine last night and it isn't now

2020-02-04 Thread Alan Krumholz
Hi Pablo, This is strange... it doesn't seem to be the last beam release as last night it was already using 2.19.0 I wonder if it was some release from the DataFlow team (not beam related): Job typeBatch Job status Succeeded SDK version Apache Beam Python 3.5 SDK 2.19.0 Region us-central1 Start tim

Re: Kafka Avro Schema Registry Support

2020-02-04 Thread Ismaël Mejía
Support for Confluent Schema Registry was merged into KafkaIO today. You can test it with tomorrow's snapshots (version 2.20.0-SNAPSHOT) or just when 2.20.0 gets released. Notice that this was already possible, but Alexey took care of making this more user friendly because this is (was) a frequentl

Re: dataflow job was working fine last night and it isn't now

2020-02-04 Thread Pablo Estrada
Hi Alan, could it be that you're picking up the new Apache Beam 2.19.0 release? Could you try depending on beam 2.18.0 to see if the issue surfaces when using the new release? If something was working and no longer works, it sounds like a bug. This may have to do with how we pickle (dill / cloudpi

Re: Best approach for recalculating statistics based on amended or deleted events?

2020-02-04 Thread Stephen Young
Thank you Reza. That was very helpful! On 2020/02/03 01:03:18, Reza Rokni wrote: > Hi, > > So https://issues.apache.org/jira/browse/BEAM-91 would be nice... but not > there sadly. > > Not ideal but some thoughts: > > With regards to state, the time scale ( for example you mentioned a week ) >

dataflow job was working fine last night and it isn't now

2020-02-04 Thread Alan Krumholz
Hi, I was running a dataflow job in GCP last night and it was running fine. This morning this same exact job is failing with the following error: Error message from worker: Traceback (most recent call last): File "/usr/local/lib/python3.5/site-packages/apache_beam/internal/pickler.py", line 286,