Hello Bojidar, Pulsar has a two level architecture in which serving/compute is separated from storage. Thus, there wouldn't be any particular advantage for collocating compute with storage. Only in the scenario in which you are running brokers with bookies together on the same node and pin topics to be served only from specific nodes can colocating produce any benefits. Though not many production environments use such a configuration as it's not very robust.
Currently, scheduling of functions is done in a round robin fashion. Best, Jerry On Wed, Apr 24, 2019 at 6:04 AM Божидар Маринов < [email protected]> wrote: > Greetings, > > We are considering using Pulsar in a project we are currently building. > Specifically, we would like to use Pulsar Functions in order to process > lots of sequential data. > > In our case, we are going to have a persistent streams of all the data so > far (so, unlimited size and time), and we want to run a function mapping > one of them to a new stream. > > Due to the potentially large amounts of data, we would like to have the > function running where the data is, as opposed to streaming most of the > data between nodes. > > So far, we determined that Pulsar Functions would get the stream > processing collocated with Pulsar, thus saving one of the roundtrips, and > we would now like to know if the node it runs on would be selected in a way > that would minimize the distance (as in latency) to the stored data. > > Additionally, we would like to know if there is a way to configure the > function so that it will be relocated to different nodes, following the > data. For example, if the first half of stream A is stored on node 1 and > the second is stored on node 2, we would like a function with stream A as > input to run on node 1 while processing the first half of the data and then > be moved to node 2. > > Thanks in advance, > Bojidar "bojidar-bg" >
