Greetings, We are considering using Pulsar in a project we are currently building. Specifically, we would like to use Pulsar Functions in order to process lots of sequential data.
In our case, we are going to have a persistent streams of all the data so far (so, unlimited size and time), and we want to run a function mapping one of them to a new stream. Due to the potentially large amounts of data, we would like to have the function running where the data is, as opposed to streaming most of the data between nodes. So far, we determined that Pulsar Functions would get the stream processing collocated with Pulsar, thus saving one of the roundtrips, and we would now like to know if the node it runs on would be selected in a way that would minimize the distance (as in latency) to the stored data. Additionally, we would like to know if there is a way to configure the function so that it will be relocated to different nodes, following the data. For example, if the first half of stream A is stored on node 1 and the second is stored on node 2, we would like a function with stream A as input to run on node 1 while processing the first half of the data and then be moved to node 2. Thanks in advance, Bojidar "bojidar-bg"
