What do you mean the value is very large in t2? How large? What is it? You 
could put the large data in separate files on HDFS and just maintain a file 
name in the table. 

> On 8. Apr 2018, at 19:52, Vitaliy Pisarev <vitaliy.pisa...@biocatch.com> 
> wrote:
> 
> I have two tables in spark:
> 
> T1
> |--x1
> |--x2
> 
> T2
> |--z1
> |--z2
> T1 is much larger than T2
> The values in column z2 are very large
> There is a Many-One relationships between T1 and T2 respectively (via the x2 
> and z1 columns).
> I perform the following query:
> 
> select T1.x1, T2.z2 from T1
> join T2 on T1.x2 = T2.z1
> In the resulting data set, the same value from T2.z2 will be multiplied to 
> many values of T1.x1.
> 
> Since this value is very heavy- I am concerned whether the data is actually 
> duplicated or whether there are internal optimisations that maintain only 
> references?
> 
> p.s
> Originally posted on SO

Reply via email to