Re: Definition of Unified model

2019-05-23 Thread Jan Lukavský
Hi all, thanks everyone for this discussion. I think I have gathered enough feedback to be able to put down a proposition for changes, which I will do and send to this list for further discussion. There are still doubts remaining the non-determinism and it's relation to outputs stability vs.

Re: @RequireTimeSortedInput design draft

2019-05-23 Thread Robert Bradshaw
Thanks for writing this up. I think the justification for adding this to the model needs to be that it is useful (you have this covered, though some examples would be nice) and that it's something that can't easily be done by users themselves (specifically, though it can be (relatively) cheaply

Re: Definition of Unified model

2019-05-23 Thread Robert Bradshaw
Good point. The "implementation-specific" way I would do this is window-by-instant, followed by a DoFn that gets all the elements with the same timestamp and sorts/acts accordingly, but this counts on the runner producing windows in timestamp order (likely?) and also the subsequent DoFn getting

@RequireTimeSortedInput design draft

2019-05-23 Thread Jan Lukavský
Hi, I have written a very brief draft of how it might be possible to implement @RequireTimeSortedInput discussed in [1]. I see the document [2] a starting point for a discussion. There are several open questions, which I believe can be resolved by this great community. :-) Jan [1]

Re: Environments for External Transforms

2019-05-23 Thread Thomas Weise
On Thu, May 23, 2019 at 3:46 AM Maximilian Michels wrote: > > Writing a new transform involves updating the expansion service to > include their new transform. > > Would it be conceivable that the expansion is performed via the > environment? That would solve the problem of updating the

Beam Summit Europe: speakers and schedule online!

2019-05-23 Thread Matthias Baetens
Hi everyone, Happy to share that the speakers and schedule are now online on the website. Make sure you register on Eventbrite if you want to attend and follow out Twitter channel

Re: Definition of Unified model

2019-05-23 Thread Reuven Lax
So Jan's example of state machines is quite a valid use case for ordering. However in my experience, timestamp ordering is insufficient for state machines. Elements that cause state transitions might come in with the exact same timestamp, yet still have a necessary ordering. Especially given

Re: Definition of Unified model

2019-05-23 Thread Reuven Lax
I'm simply saying that timestamp ordering is insufficient for state machines. I wasn't proposing Kafka as a solution - that was simply an example of how people solve this problem in other scenarios. BTW another example of ordering: Imagine today that you have a triggered Sum aggregation writing

DISCUSS: Sorted MapState API

2019-05-23 Thread Reuven Lax
Beam's state API is intended to be useful as an alternative aggregation method. Stateful ParDos can aggregate elements into state and set timers to read the state. Currently Beam's state API supports two ways of storing collections of elements: BagState and MapState. BagState is the one most

Re: Jira component for HDFS issues with Python SDK

2019-05-23 Thread Chamikara Jayalath
On Thu, May 23, 2019 at 10:28 AM Pablo Estrada wrote: > I've added the io-python-hadoop component [1]. We can continue to discuss > naming conventions for components[2]. > > The way Java is set up, we have: > sdk-java-* > io-java-* > > For Python, we have: > sdk-py-* > io-python-* > > We could

Re: Jira component for HDFS issues with Python SDK

2019-05-23 Thread Chamikara Jayalath
On Thu, May 23, 2019 at 9:20 AM Valentyn Tymofieiev wrote: > Hi, > > Could someone please help with addition io-python-hadoop or similar > component to Jira? > > Also, there is a small discrepancy in naming py vs python between: > > io-python-gcp and sdk-py-core - consider unifying them. > +1

Re: Definition of Unified model

2019-05-23 Thread Reuven Lax
So an example would be elements of type "startUserSession" and "endUserSession" (website sessions, not Beam sessions). Logically you may need to process them in the correct order if you have any sort of state-machine logic. However timestamp ordering is never guaranteed to match the logical

Jira component for HDFS issues with Python SDK

2019-05-23 Thread Valentyn Tymofieiev
Hi, Could someone please help with addition io-python-hadoop or similar component to Jira? Also, there is a small discrepancy in naming py vs python between: io-python-gcp and sdk-py-core - consider unifying them. Thank you!

Re: DISCUSS: Sorted MapState API

2019-05-23 Thread Lukasz Cwik
I would suggest that we drop MapState and instead support MultimapState without ordering as a first pass and potentially add ordering later. Inside an SDK we would be able to build a "sorted" multimap state that uses a prefix of the key and implement radix based sorting. Any time a prefix has too

Re: Jira component for HDFS issues with Python SDK

2019-05-23 Thread Pablo Estrada
> > My main point is that there should be a limited number of top level > guessable prefixes. Looks like main prefixes today are runners-*, sdk-*, > and io-*. As long as this list small it should be fine (which is the case > today). Also, probably we document this structure somewhere. JIRA

Re: Definition of Unified model

2019-05-23 Thread Reuven Lax
Not really. I'm suggesting that some variant of FIFO ordering is necessary, which requires either runners natively support FIFO ordering or transforms adding some extra sequence number to each record to sort by. I still think your proposal is very useful by the way. I'm merely pointing out that

Re: DISCUSS: Sorted MapState API

2019-05-23 Thread Rui Wang
> > A few obvious problems with this code: > 1. Removing the elements already processed from the bag requires > clearing and rewriting the entire bag. This is O(n^2) in the number of > input trades. > why it's not O(2 * n) to clearing and rewriting trade state? > public interface

Re: Beam at Google Summer of Code 2019

2019-05-23 Thread Tanay Tummalapalli
Hi everyone, I made a Kanban board[1] on Github, on my fork of apache/beam to keep track of progress for GSoC '19. Regards, Tanay Tummalapalli [1] https://github.com/ttanay/beam/projects/1 On Tue, May 7, 2019 at 6:39 PM Tanay Tummalapalli wrote: > Thank You! > > I'm really excited to work on

Re: Definition of Unified model

2019-05-23 Thread Jan Lukavský
Hi Reuven, I share the view point of Robert. I think the isuue you refer to is not in reality related to timestamps, but to the fact, that ordering of events in time is observer dependent (either caused by relativity, or time skew, essentially this has the same consequences). And the resolution

Re: Definition of Unified model

2019-05-23 Thread Jan Lukavský
Hi, yes. It seems that ordering by user supplied UDF makes sense and I will update the design proposal accordingly. Would that solve the issues you mention? Jan -- Původní e-mail -- Od: Reuven Lax Komu: dev Datum: 23. 5. 2019 18:44:38 Předmět: Re: Definition of Unified

Re: Hazelcast Jet Runner

2019-05-23 Thread Ismaël Mejía
I saw that the runner was merged but I don’t get why the foler is called ‘runners/jet experimental’ and not simply ‘runners/jet’. Is it because the runner does not pass ValidatesRunner? Or because the contributors are few? I don’t really see any reason behind this suffix. And even if the status is

Re: Beam Summit Europe: speakers and schedule online!

2019-05-23 Thread Kenneth Knowles
Nice! What a great spread of topics. This should be an amazing event. Kenn On Thu, May 23, 2019 at 1:58 PM Joana Filipa Bernardo Carrasqueira < joanafil...@google.com> wrote: > Hi all! > > Looking forward to the conversations about Beam and to meet new people in > the community! > > Please help

Quota: In use IP-adresses

2019-05-23 Thread Mikhail Gryzykhin
Hello everybody, Some of our jobs fail with 1/0 in use IP-addresses quota exception. Seems that we spin-up too many VMs and run out of IP-addresses. Should we bump the quota to mitigate the issue? Regards, Mikhail. --- https://issues.apache.org/jira/browse/BEAM-7410

Re: Beam Summit Europe: speakers and schedule online!

2019-05-23 Thread Joana Filipa Bernardo Carrasqueira
Hi all! Looking forward to the conversations about Beam and to meet new people in the community! Please help us spreading the word about the Beam Summit within your networks and register for the event here . See you all

Re: DISCUSS: Sorted MapState API

2019-05-23 Thread Lukasz Cwik
On Thu, May 23, 2019 at 1:53 PM Ahmet Altay wrote: > > > On Thu, May 23, 2019 at 1:38 PM Lukasz Cwik wrote: > >> >> >> On Thu, May 23, 2019 at 11:37 AM Rui Wang wrote: >> >>> A few obvious problems with this code: 1. Removing the elements already processed from the bag requires

Re: DISCUSS: Sorted MapState API

2019-05-23 Thread Lukasz Cwik
On Thu, May 23, 2019 at 11:37 AM Rui Wang wrote: > A few obvious problems with this code: >> 1. Removing the elements already processed from the bag requires >> clearing and rewriting the entire bag. This is O(n^2) in the number of >> input trades. >> > why it's not O(2 * n) to clearing and

Re: Dataflow runner with Apache Beam 2.12

2019-05-23 Thread pasquale . bonito
Hi Lukasz, there was no error, simply the pipeline was not processing data. However I figured out the error: I was missing dependency from Google IO. Thanks On 2019/05/22 19:24:56, Lukasz Cwik wrote: > Have you tried following the troubleshooting your pipeline guide[1]? > Have you tried to

Re: DISCUSS: Sorted MapState API

2019-05-23 Thread Ahmet Altay
On Thu, May 23, 2019 at 1:38 PM Lukasz Cwik wrote: > > > On Thu, May 23, 2019 at 11:37 AM Rui Wang wrote: > >> A few obvious problems with this code: >>> 1. Removing the elements already processed from the bag requires >>> clearing and rewriting the entire bag. This is O(n^2) in the number of

Re: DISCUSS: Sorted MapState API

2019-05-23 Thread Reza Rokni
+100 As someone who has just spent a lot of time coding all the "GC, caching of BagState fetches, etc" this would make life a lot easier! Its also generally valuable for a lot of timeseries work. On Fri, 24 May 2019 at 04:59, Lukasz Cwik wrote: > > > On Thu, May 23, 2019 at 1:53 PM Ahmet Altay

Re: Jira component for HDFS issues with Python SDK

2019-05-23 Thread Valentyn Tymofieiev
On Thu, May 23, 2019 at 11:23 AM Pablo Estrada wrote: > My main point is that there should be a limited number of top level >> guessable prefixes. Looks like main prefixes today are runners-*, sdk-*, >> and io-*. As long as this list small it should be fine (which is the case >> today). Also,

Re: Custom Watermark Instance being created multiple times for KafkaIO

2019-05-23 Thread Lukasz Cwik
The watermark should be checkpointed along with partition offsets. You will have one watermark class instance for each bundle that is processing. You will have one bundle processed per checkpoint and also one bundle per split (so that Kafka can be read in parallel by multiple workers) and also one

Re: Shuffling on apache beam

2019-05-23 Thread Reuven Lax
Can you explain what you mean by worker? While every runner has workers of course, workers are not part of the programming model. On Thu, May 23, 2019 at 8:13 PM pasquale.bon...@gmail.com < pasquale.bon...@gmail.com> wrote: > Hi all, > I would like to know if Apache Beam has a functionality

Re: DISCUSS: Sorted MapState API

2019-05-23 Thread Reuven Lax
On Thu, May 23, 2019 at 1:53 PM Ahmet Altay wrote: > > > On Thu, May 23, 2019 at 1:38 PM Lukasz Cwik wrote: > >> >> >> On Thu, May 23, 2019 at 11:37 AM Rui Wang wrote: >> >>> A few obvious problems with this code: 1. Removing the elements already processed from the bag requires

Re: DISCUSS: Sorted MapState API

2019-05-23 Thread Reuven Lax
On Thu, May 23, 2019 at 11:23 AM Lukasz Cwik wrote: > I would suggest that we drop MapState and instead support MultimapState > without ordering as a first pass and potentially add ordering later. > While I agree that MultimapState is useful on its own, for these aggregation use cases ordering

Re: Dataflow runner with Apache Beam 2.12

2019-05-23 Thread pasquale . bonito
Hi Lukasz, there was no error displayed, but I got the problem. I was missing dependencies from google cloud io. Thanks On 2019/05/22 19:24:56, Lukasz Cwik wrote: > Have you tried following the troubleshooting your pipeline guide[1]? > Have you tried to reach out to Google Cloud support with

Re: Dataflow runner with Apache Beam 2.12

2019-05-23 Thread pasquale . bonito
I was missing dependency from beam-sdks-java-io-google-cloud-platform. Quite strange Dataflow didn't show any errors. On 2019/05/22 19:24:56, Lukasz Cwik wrote: > Have you tried following the troubleshooting your pipeline guide[1]? > Have you tried to reach out to Google Cloud support with an

Shuffling on apache beam

2019-05-23 Thread pasquale . bonito
Hi all, I would like to know if Apache Beam has a functionality similar to fieldsGrouping in Storm that allows to send records to a specific task/worker based on a key. I know that we can achieve that with a combine/grouByKey operation but that implies to add a windowing in our pipeline that we

Re: Environments for External Transforms

2019-05-23 Thread Maximilian Michels
Writing a new transform involves updating the expansion service to include their new transform. Would it be conceivable that the expansion is performed via the environment? That would solve the problem of updating the expansion service, although it adds additional complexity for bringing up

Re: Environments for External Transforms

2019-05-23 Thread Robert Bradshaw
On Thu, May 23, 2019 at 12:46 PM Maximilian Michels wrote: > > Writing a new transform involves updating the expansion service to > include their new transform. > > Would it be conceivable that the expansion is performed via the > environment? That would solve the problem of updating the

Re: Environments for External Transforms

2019-05-23 Thread Maximilian Michels
My motivation was to get rid of the Docker dependency for the Python VR tests. Similarly to how we use Python's LOOPBACK environment for executing all non-cross-language tests, I wanted to use Java's EMBEDDED environment to run the cross-language transforms. I suppose we could also go with an

Re: Environments for External Transforms

2019-05-23 Thread Robert Bradshaw
On Thu, May 23, 2019 at 11:07 AM Maximilian Michels wrote: > My motivation was to get rid of the Docker dependency for the Python VR > tests. Similarly to how we use Python's LOOPBACK environment for > executing all non-cross-language tests, I wanted to use Java's EMBEDDED > environment to run

Re: Environments for External Transforms

2019-05-23 Thread Robert Bradshaw
On Wed, May 22, 2019 at 6:17 PM Maximilian Michels wrote: > Hi, > > Robert and me were discussing on the subject of user-specified > environments for external transforms [1]. We couldn't decide whether > users should have direct control over the environment when they use an > external transform

Re: Beam dashboards

2019-05-23 Thread Łukasz Gajowy
Thanks for the hint Mikhail! I'll take a look at that. :) Best, Łukasz śr., 22 maj 2019 o 00:19 Mikhail Gryzykhin napisał(a): > @Łukasz Gajowy > > Reviving old thread: > Grafana doesn't support BigQuery officially, but recently there were news > with unofficial BQ plugin: > * >