Afaik you can express the partition key in Table API now which will be used for co-location and optimization. So I'd probably give that a try first and convert the Table to DataStream where needed.
On Sat, Jul 24, 2021 at 9:22 PM Dan Hill <quietgol...@gmail.com> wrote: > Thanks Fabian and Senhong! > > Here's an example diagram of the join that I want to do. There are more > layers of joins. > > https://docs.google.com/presentation/d/17vYTBUIgrdxuYyEYXrSHypFhwwS7NdbyhVgioYMxPWc/edit#slide=id.p > > 1) Thanks! I'll look into these. > > 2) I'm using the same key across multiple Kafka topics. I can change the > producers and consumers to write to whatever partitions that would help. > > The job is pretty simple right now. No optimizations. We're currently > running this on one task manager. The goal of the email was to start > thinking about optimizations. If the usual practice is to let Flink > regroup the kafka sources on input, how do teams deal with the serde > overhead of this? Just factor it into overhead and allocate more resources? > > > > > On Fri, Jul 23, 2021 at 3:21 AM Senhong Liu <senhong...@gmail.com> wrote: > >> Hi Dan, >> >> 1) If the key doesn’t change in the downstream operators and you want to >> avoid shuffling, maybe the DataStreamUtils#reinterpretAsKeyedStream would >> be helpful. >> >> 2) I am not sure that if you are saying that the data are already >> partitioned in the Kafka and you want to avoid shuffling in the Flink >> because of reusing keyBy(). One solution is that you can try to partition >> your data in the Kafka as if it was partitioned in the Flink when using >> keyBy(). After that, feel free to >> use DataStreamUtils#reinterpretAsKeyedStream! >> >> If your use case is not what I described above, maybe you can provide us >> more information. >> >> Best, >> Senhong >> >> Sent with a Spark <https://sparkmailapp.com/source?from=signature> >> On Jul 22, 2021, 7:33 AM +0800, Dan Hill <quietgol...@gmail.com>, wrote: >> >> Hi. >> >> 1) If I use the same key in downstream operators (my key is a user id), >> will the rows stay on the same TaskManager machine? I join in more info >> based on the user id as the key. I'd like for these to stay on the same >> machine rather than shuffle a bunch of user-specific info to multiple task >> manager machines. >> >> 2) What are best practices to reduce the number of shuffles when having >> multiple kafka topics with similar keys (user id). E.g. should I make make >> sure the same key writes to the same partition number and then manually >> which flink tasks get which kafka partitions? >> >>