Re: Side Inputs size

Augusto Ribeiro Wed, 10 Apr 2019 02:50:04 -0700

Thanks for the input.

Storing the maps in a static variable increase a lot the performance. Of course 
if these "sideInputs" grow too large I might need to translate these into the 
CoGroupByKey option.


Thanks again,
Augusto

> On 8 Apr 2019, at 20:07, Lukasz Cwik <[email protected]> wrote:
> 
> Side input performance and scaling is runner dependent. Runners should 
> attempt to provide support for efficient random access lookup in the maps.
> Side inputs should also be cached across elements if the map hasn't changed 
> which runners should also be capable of doing.
> 
> So yes, side input size can impact performance depending on which runner you 
> choose to use. Some runners don't deal with side inputs at all while others 
> may scale to support terabytes in size.
> 
> Saving it as a static class variable may be a useful workaround if the runner 
> is not performing as well as you would like.
> 
> Map side inputs are usually used to produce joins. Have you tried using 
> CoGroupByKey to do the join instead?
> 
> On Mon, Apr 8, 2019 at 10:30 AM [email protected] 
> <mailto:[email protected]> <[email protected] 
> <mailto:[email protected]>> wrote:
> Hi,
> 
> In one of my transforms I am using Map which is the result of a previous 
> transform as a sideInput. This Map<String, Int>  is potentially very large 
> with count of all words that appeared in all documents. 
> 
> The step that uses the sideInput is quite slow because it seems like it is 
> initialising a huge Hashmap for every element it processes (I followed this 
> example https://beam.apache.org/documentation/programming-guide/#side-inputs 
> <https://beam.apache.org/documentation/programming-guide/#side-inputs>)
> 
> Is this the wrong way of using sideInputs? And by this I mean, can a 
> sideInput be too big to be a sideInput? I also thought about saving the 
> sideInput as a static class variable, then in principle I only have to read 
> it once per "transform" initialised in the cluster.
> 
> Am I going totally wrong about this, should I try other approaches?
> 
> Best regards,
> Augusto
> 
>

Re: Side Inputs size

Reply via email to