Re: Cross operation on two huge datasets

Fabian Hueske Wed, 22 Feb 2017 14:42:14 -0800

Hi Gwen,

Flink usually performs a block nested loop join to cross two data sets.
This algorithm spills one input to disk and streams the other input. For
each input it fills a memory buffer and to perform the cross. Then the
buffer of the spilled input is refilled with spilled records and records
are again crossed. This is done until one iteration over the spill records
is done. Then the other buffer of the streamed input is filled with the
next records.


You should be aware that cross is a super expensive operation, especially
if you evaluate a complex condition for each pair of records. So cross can
be easily too expensive to compute.
For such use cases it is usually better to apply a coarse-grained spatial
partitioning and do a key-based join on the partitions. Within each
partition you'd perform a cross.

Best, Fabian


2017-02-21 18:34 GMT+01:00 Gwenhael Pasquiers <
gwenhael.pasqui...@ericsson.com>:

> Hi,
>
>
>
> I need (or at least I think I do) to do a cross operation between two huge
> datasets. One dataset is a list of points. The other one is a list of
> shapes (areas).
>
>
>
> I want to know, for each point, the areas (they might overlap so a point
> can be in multiple areas) it belongs to so I thought I’d “cross” my points
> and areas since I need to test each point against each area.
>
>
>
> I tried it and my job stucks seems to work for some seconds then, at some
> point, it stucks.
>
>
>
> I’m wondering if Flink, for cross operations, tries to load one of the two
> datasets into RAM or if it’s able to split the job in multiple iterations
> (even if it means reading one of the two datasets multiple times).
>
>
>
> Or maybe I’m going at it the wrong way, or missing some parameters, feel
> free to correct me J
>
>
>
> I’m using flink 1.0.1.
>
>
>
> Thanks in advance
>
>
>
> Gwen’
>

Re: Cross operation on two huge datasets

Reply via email to