Does Top-N query have to be result-updating, even for batch input?

Yik San Chan Sat, 08 May 2021 02:22:43 -0700

Hi community,

Top-N query is result-updating, according to [docs](
https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/dev/table/sql/queries/topn/
):


> The TopN query is *Result Updating*. Flink SQL will sort the input data
stream according to the order key, so if the top N records have been
changed, the changed ones will be sent as retraction/update records to
downstream.

It makes sense for streaming input. However, I feel it makes less sense for
batch input. Why? Because the intermediate result of Top-N is not useful
for a batch input.

For example [here](
https://github.com/YikSanChan/pyflink-quickstart/blob/12b89341301c62072c4dfff4eecc39edad61ef9b/topk.py),
I get a lot of intermediate output.

For example, take a look at this input:

```

1,100,1
1,101,3
2,201,4
2,200,2
2,202,5
2,203,6
1,102,7
1,103,8
1,104,9
2,204,10

```

The first column being user_id, the second being movie_id, the third being
score (generated by some ML model). I want to tell what the top 3 movies
that each user_id should watch.

Since the input is batch and finite, what I need is really just:

1 -> 102,103,104
2 -> 202,203,204

However, using this [PyFlink program](
https://github.com/YikSanChan/pyflink-quickstart/blob/12b89341301c62072c4dfff4eecc39edad61ef9b/topk.py),
I get a ton of intermediate output:

```
+I(1,100)
-U(1,100)
+U(1,100,101)
+I(2,201)
-U(2,201)
+U(2,201,200)
-U(2,201,200)
+U(2,201,200,202)
-U(2,201,200,202)
+U(2,201,202)
-U(2,201,202)
+U(2,201,202,203)
-U(1,100,101)
+U(1,100,101,102)
-U(1,100,101,102)
+U(1,101,102)
-U(1,101,102)
+U(1,101,102,103)
-U(1,101,102,103)
+U(1,102,103)
-U(1,102,103)
+U(1,102,103,104)
-U(2,201,202,203)
+U(2,202,203)
-U(2,202,203)
+U(2,202,203,204)
```

And since it is result-updating, I am not able to use a file as the sink.

My question is: is there a way to generate top-n movies for each user, and
make sure the output is not result-updating?

Thank you!

Best,
Yik San

Does Top-N query have to be result-updating, even for batch input?

Reply via email to