The value is already stored in azure blob store and the entities in T1 reference it. My problem is that in the computation I need to run, in order to fetch the referenced value I pay a very large i/o penalty.
The reason being that this is done once per record in T1, which may contain 1 million records. Fortunately, I have the referenced values stored in parquet, so I figured I'd try a different access pattern. On Sun, Apr 8, 2018, 20:58 Jörn Franke <jornfra...@gmail.com> wrote: > What do you mean the value is very large in t2? How large? What is it? You > could put the large data in separate files on HDFS and just maintain a file > name in the table. > > On 8. Apr 2018, at 19:52, Vitaliy Pisarev <vitaliy.pisa...@biocatch.com> > wrote: > > I have two tables in spark: > > T1 > |--x1 > |--x2 > > T2 > |--z1 > |--z2 > > > - T1 is much larger than T2 > - The values in column z2 are *very large* > - There is a Many-One relationships between T1 and T2 respectively > (via the x2 and z1 columns). > > I perform the following query: > > select T1.x1, T2.z2 from T1 > join T2 on T1.x2 = T2.z1 > > In the resulting data set, the same value from T2.z2 will be multiplied to > many values of T1.x1. > > Since this value is very heavy- I am concerned whether the data is > actually duplicated or whether there are internal optimisations that > maintain only references? > p.s > Originally posted on SO <https://stackoverflow.com/q/49716385/180650> > >