Re: [External] Re: DISCUSS: Sorted MapState API

2020-09-28 Thread Kenneth Knowles
On Wed, Sep 16, 2020 at 8:48 AM Tyson Hamilton wrote: > The use case is to support an unbounded stream-stream join, where the > elements are arriving in roughly time sorted order. Removing a specific > element from the timestamp indexed collection is necessary when a match is > found. > Just

Re: [External] Re: DISCUSS: Sorted MapState API

2020-09-16 Thread Tyson Hamilton
The use case is to support an unbounded stream-stream join, where the elements are arriving in roughly time sorted order. Removing a specific element from the timestamp indexed collection is necessary when a match is found. Having clearRange is helpful to expire elements that are no longer

Re: [External] Re: DISCUSS: Sorted MapState API

2020-09-15 Thread Reuven Lax
Hi, Currently we only support removing a timestamp range. You can remove a single timestamp of course by removing [ts, ts+1), however if there are multiple elements with the same timestamp this will remove all of those elements. Does this fit your use case? If not, I wonder if MapState is closer

Re: [External] Re: DISCUSS: Sorted MapState API

2020-09-15 Thread Tyson Hamilton
Hi Reuven, I noticed that there was an implementation of the in-memory OrderedListState introduced [1]. Where can I find out more regarding the plan and design? Is there a design doc? I'd like to know more details about the implementation to see if it fits my use case. I was hoping it would

Re: [External] Re: DISCUSS: Sorted MapState API

2020-08-03 Thread Catlyn Kong
Hey folks, Sry I'm late to this thread but this might be very helpful for the problem we're dealing with. Do we have a design doc or a jira ticket I can follow? Cheers, Catlyn On Thu, Jun 18, 2020 at 1:11 PM Jan Lukavský wrote: > My questions were just an example. I fully agree there is a

Re: DISCUSS: Sorted MapState API

2020-06-18 Thread Jan Lukavský
My questions were just an example. I fully agree there is a fundamental need for a sorted state (of some form, and I also think this links to efficient implementation of retrations) - I was reacting to Kenn's question about BIP. This one would be pretty nice example why it would be good to

Re: DISCUSS: Sorted MapState API

2020-06-18 Thread Reuven Lax
Jan - my proposal is exactly TimeSortedBagState (more accurately - TimeSortedListState), though I went a bit further and also proposed a way to have a dynamic number of tagged TimeSortedBagStates. You are correct that the runner doesn't really have to store the data time sorted - what's actually

Re: DISCUSS: Sorted MapState API

2020-06-18 Thread Jan Lukavský
Big +1 for a BIP, as this might really help clarify all the pros and cons of all possibilities. There seem to be questions that need answering and motivating use cases - do we need sorted map state or can we solve our use cases by something simpler - e.g. the mentioned TimeSortedBagState? Does

Re: DISCUSS: Sorted MapState API

2020-06-17 Thread Kenneth Knowles
Zooming in from generic philosophy to be clear: adding time ordered buffer to the Fn state API is *not* a shortcut.It has benefits that will not be achieved by SDK-side implementation on top of either ordered or unordered multimap. Are those benefits worth expanding the API? I don't know. A

Re: DISCUSS: Sorted MapState API

2020-06-17 Thread Reuven Lax
It might help for me to describe what I have in mind. I'm still proposing that we build multimap, just not a globally-sorted multimap. My previous proposal was that we provide a Multimap state type, sorted by key. this would have two additional operations - multimap.getRange(startKey, endKey) and

Re: DISCUSS: Sorted MapState API

2020-06-17 Thread Luke Cwik
The portability layer is meant to live across multiple versions of Beam and I don't think it should be treated by doing the simple and useful thing now since I believe it will lead to a proliferation of the API. On Wed, Jun 17, 2020 at 2:30 PM Kenneth Knowles wrote: > I have thoughts on the

Re: DISCUSS: Sorted MapState API

2020-06-17 Thread Kenneth Knowles
I have thoughts on the subject of whether to have APIs just for the lowest-level building blocks versus having APIs for higher-level constructs. Specifically this applies to providing only unsorted multimap vs what I will call "time-ordered buffer". TL;DR: I'd vote to focus on time-ordered buffer;

Re: DISCUSS: Sorted MapState API

2020-06-16 Thread Luke Cwik
I would be glad to take a stab at how to provide sorting on top of unsorted multimap state. Based upon your description, you want integer keys representing timestamps and arbitrary user value for the values, is that correct? What kinds of operations do you need on the sorted map state in order of

Re: DISCUSS: Sorted MapState API

2019-06-02 Thread Reuven Lax
This sounds to me like a potential runner strategy. However if a runner can natively support sorted maps (e.g. we expect the Dataflow runner to be able to do so, and I think it would be useful for other runners as well), then it's probably preferable to allow the runner to use its native

Re: DISCUSS: Sorted MapState API

2019-05-30 Thread Kenneth Knowles
On Tue, May 28, 2019 at 2:59 AM Robert Bradshaw wrote: > On Fri, May 24, 2019 at 6:57 PM Kenneth Knowles wrote: > > > > On Fri, May 24, 2019 at 9:51 AM Kenneth Knowles wrote: > >> > >> On Fri, May 24, 2019 at 8:14 AM Reuven Lax wrote: > >>> > >>> Some great comments! > >>> > >>> Aljoscha:

Re: DISCUSS: Sorted MapState API

2019-05-28 Thread Robert Bradshaw
On Fri, May 24, 2019 at 6:57 PM Kenneth Knowles wrote: > > On Fri, May 24, 2019 at 9:51 AM Kenneth Knowles wrote: >> >> On Fri, May 24, 2019 at 8:14 AM Reuven Lax wrote: >>> >>> Some great comments! >>> >>> Aljoscha: absolutely this would have to be implemented by runners to be >>> efficient.

Re: DISCUSS: Sorted MapState API

2019-05-24 Thread Lukasz Cwik
In my look for, it should have said "void". instead of "key". when explaining how to do it. On Fri, May 24, 2019 at 11:05 AM Lukasz Cwik wrote: > For the API that you proposed, the map key is always "void" and the sort > key == user key. So in my example of > key: dummy value > key.000: token,

Re: DISCUSS: Sorted MapState API

2019-05-24 Thread Lukasz Cwik
For the API that you proposed, the map key is always "void" and the sort key == user key. So in my example of key: dummy value key.000: token, (0001, value4) key.001: token, (0010, value1), (0011, value2) key.01: token key.1: token, (1011, value3) you would have: "void": dummy value "void".000:

Re: DISCUSS: Sorted MapState API

2019-05-24 Thread Reuven Lax
Can you explain how fetching and deleting ranges of keys would work with this data structure? On Fri, May 24, 2019 at 9:50 AM Lukasz Cwik wrote: > Reuven, for the example, I assume that we never want to store more then 2 > values at a given sort key prefix, and if we do then we will create a

Re: DISCUSS: Sorted MapState API

2019-05-24 Thread Kenneth Knowles
On Fri, May 24, 2019 at 9:51 AM Kenneth Knowles wrote: > > > On Fri, May 24, 2019 at 8:14 AM Reuven Lax wrote: > >> Some great comments! >> >> *Aljoscha*: absolutely this would have to be implemented by runners to >> be efficient. We can of course provide a default (inefficient) >>

Re: DISCUSS: Sorted MapState API

2019-05-24 Thread Kenneth Knowles
On Fri, May 24, 2019 at 8:14 AM Reuven Lax wrote: > Some great comments! > > *Aljoscha*: absolutely this would have to be implemented by runners to be > efficient. We can of course provide a default (inefficient) implementation, > but ideally runners would provide better ones. > > *Jan* Exactly.

Re: DISCUSS: Sorted MapState API

2019-05-24 Thread Reuven Lax
On Fri, May 24, 2019 at 9:36 AM Rui Wang wrote: > >> *Jan* Exactly. I think MapState can be dropped or backed by this. E.g. >> >>> >>> Regarding to performance, I have a concern to drop MapSate or use > SortedMapState to back it. Although SortedMapState provide API to remove a > single key, I

Re: DISCUSS: Sorted MapState API

2019-05-24 Thread Lukasz Cwik
Reuven, for the example, I assume that we never want to store more then 2 values at a given sort key prefix, and if we do then we will create a new longer prefix splitting up the values based upon the sort key. Tuple representation in examples below is (key, sort key, value) and . is a character

Re: DISCUSS: Sorted MapState API

2019-05-24 Thread Rui Wang
> > > *Jan* Exactly. I think MapState can be dropped or backed by this. E.g. > >> >> Regarding to performance, I have a concern to drop MapSate or use SortedMapState to back it. Although SortedMapState provide API to remove a single key, I would imagine its implementation in runners will different

Re: DISCUSS: Sorted MapState API

2019-05-24 Thread Reuven Lax
Some great comments! *Aljoscha*: absolutely this would have to be implemented by runners to be efficient. We can of course provide a default (inefficient) implementation, but ideally runners would provide better ones. *Jan* Exactly. I think MapState can be dropped or backed by this. E.g.

Re: DISCUSS: Sorted MapState API

2019-05-24 Thread Robert Bradshaw
On Fri, May 24, 2019 at 5:32 AM Reuven Lax wrote: > > On Thu, May 23, 2019 at 1:53 PM Ahmet Altay wrote: >> >> >> >> On Thu, May 23, 2019 at 1:38 PM Lukasz Cwik wrote: >>> >>> >>> >>> On Thu, May 23, 2019 at 11:37 AM Rui Wang wrote: > > A few obvious problems with this code: > 1.

Re: DISCUSS: Sorted MapState API

2019-05-24 Thread Jan Lukavský
. 5. 2019 10:56:45 Předmět: Re: DISCUSS: Sorted MapState API " This is quite interesting! The Flink Table API (relational and SQL) has an implementation for the type of join you mention in the example. We call it Temporal Table Join, and it works on something we call Temporal Tables: 

Re: DISCUSS: Sorted MapState API

2019-05-24 Thread Aljoscha Krettek
This is quite interesting! The Flink Table API (relational and SQL) has an implementation for the type of join you mention in the example. We call it Temporal Table Join, and it works on something we call Temporal Tables:

Re: DISCUSS: Sorted MapState API

2019-05-23 Thread Reuven Lax
On Thu, May 23, 2019 at 1:53 PM Ahmet Altay wrote: > > > On Thu, May 23, 2019 at 1:38 PM Lukasz Cwik wrote: > >> >> >> On Thu, May 23, 2019 at 11:37 AM Rui Wang wrote: >> >>> A few obvious problems with this code: 1. Removing the elements already processed from the bag requires

Re: DISCUSS: Sorted MapState API

2019-05-23 Thread Reuven Lax
On Thu, May 23, 2019 at 11:23 AM Lukasz Cwik wrote: > I would suggest that we drop MapState and instead support MultimapState > without ordering as a first pass and potentially add ordering later. > While I agree that MultimapState is useful on its own, for these aggregation use cases ordering

Re: DISCUSS: Sorted MapState API

2019-05-23 Thread Reza Rokni
+100 As someone who has just spent a lot of time coding all the "GC, caching of BagState fetches, etc" this would make life a lot easier! Its also generally valuable for a lot of timeseries work. On Fri, 24 May 2019 at 04:59, Lukasz Cwik wrote: > > > On Thu, May 23, 2019 at 1:53 PM Ahmet Altay

Re: DISCUSS: Sorted MapState API

2019-05-23 Thread Lukasz Cwik
On Thu, May 23, 2019 at 1:53 PM Ahmet Altay wrote: > > > On Thu, May 23, 2019 at 1:38 PM Lukasz Cwik wrote: > >> >> >> On Thu, May 23, 2019 at 11:37 AM Rui Wang wrote: >> >>> A few obvious problems with this code: 1. Removing the elements already processed from the bag requires

Re: DISCUSS: Sorted MapState API

2019-05-23 Thread Ahmet Altay
On Thu, May 23, 2019 at 1:38 PM Lukasz Cwik wrote: > > > On Thu, May 23, 2019 at 11:37 AM Rui Wang wrote: > >> A few obvious problems with this code: >>> 1. Removing the elements already processed from the bag requires >>> clearing and rewriting the entire bag. This is O(n^2) in the number of

Re: DISCUSS: Sorted MapState API

2019-05-23 Thread Lukasz Cwik
On Thu, May 23, 2019 at 11:37 AM Rui Wang wrote: > A few obvious problems with this code: >> 1. Removing the elements already processed from the bag requires >> clearing and rewriting the entire bag. This is O(n^2) in the number of >> input trades. >> > why it's not O(2 * n) to clearing and

Re: DISCUSS: Sorted MapState API

2019-05-23 Thread Rui Wang
> > A few obvious problems with this code: > 1. Removing the elements already processed from the bag requires > clearing and rewriting the entire bag. This is O(n^2) in the number of > input trades. > why it's not O(2 * n) to clearing and rewriting trade state? > public interface

Re: DISCUSS: Sorted MapState API

2019-05-23 Thread Lukasz Cwik
I would suggest that we drop MapState and instead support MultimapState without ordering as a first pass and potentially add ordering later. Inside an SDK we would be able to build a "sorted" multimap state that uses a prefix of the key and implement radix based sorting. Any time a prefix has too