Hi everyone,

I have the following scenario: I have a database table with 3 columns: a
host (string), a timestamp, and an integer ID. Conceptually, what I'd like
to do is:

group by host and timestamp -> based on all the IDs in each group, create a
mapping to n new tuples -> for each unique tuple, count how many times it
appeared across the resulting data

Each new tuple has 3 fields: the host, a new ID, and an Integer=1

What I'm currently doing is roughly:

val input = JDBCInputFormat.buildJDBCInputFormat()...finish()
val source = environment.createInput(inut)
source.partitionByHash("host", "timestamp").mapPartition(...).groupBy(0,
1).aggregate(SUM, 2)

The query given to JDBCInputFormat provides results ordered by host and
timestamp, and I was wondering if performance can be improved by specifying
this in the code. I've looked at
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Terminology-Split-Group-and-Partition-td11030.html
and
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Fwd-Processing-Sorted-Input-Datasets-td20038.html,
but I still have some questions:

- If a split is a subset of a partition, what is the meaning of
SplitDataProperties#splitsPartitionedBy? The wording makes me thing that a
split is divided into partitions, meaning that a partition would be a
subset of a split.
- At which point can I retrieve and adjust a SplitDataProperties instance,
if possible at all?
- If I wanted a coarser parallelization where each slot gets all the data
for the same host, would I have to manually create the sub-groups based on
timestamp?

Regards,
Alexis.

Reply via email to