Re: Does joining table in Spark multiplies selected columns of smaller table?

Jörn Franke Sun, 08 Apr 2018 10:59:44 -0700

What do you mean the value is very large in t2? How large? What is it? You 
could put the large data in separate files on HDFS and just maintain a file 
name in the table.


> On 8. Apr 2018, at 19:52, Vitaliy Pisarev <vitaliy.pisa...@biocatch.com> 
> wrote:
> 
> I have two tables in spark:
> 
> T1
> |--x1
> |--x2
> 
> T2
> |--z1
> |--z2
> T1 is much larger than T2
> The values in column z2 are very large
> There is a Many-One relationships between T1 and T2 respectively (via the x2 
> and z1 columns).
> I perform the following query:
> 
> select T1.x1, T2.z2 from T1
> join T2 on T1.x2 = T2.z1
> In the resulting data set, the same value from T2.z2 will be multiplied to 
> many values of T1.x1.
> 
> Since this value is very heavy- I am concerned whether the data is actually 
> duplicated or whether there are internal optimisations that maintain only 
> references?
> 
> p.s
> Originally posted on SO

Re: Does joining table in Spark multiplies selected columns of smaller table?

Reply via email to