Can you split the files beforehand in several files (e.g. By the column you do 
the join on?) ? 

> On 10 Nov 2016, at 23:45, Stuart White <stuart.whi...@gmail.com> wrote:
> 
> I have a large "master" file (~700m records) that I frequently join smaller 
> "transaction" files to.  (The transaction files have 10's of millions of 
> records, so too large for a broadcast join).
> 
> I would like to pre-sort the master file, write it to disk, and then, in 
> subsequent jobs, read the file off disk and join to it without having to 
> re-sort it.  I'm using Spark SQL, and my understanding is that the Spark 
> Catalyst Optimizer will choose an optimal join algorithm if it is aware that 
> the datasets are sorted.  So, the trick is to make the optimizer aware that 
> the master file is already sorted.
> 
> I think SPARK-12394 provides this functionality, but I can't seem to put the 
> pieces together for how to use it. 
> 
> Could someone possibly provide a simple example of how to:
> Sort a master file by a key column and write it to disk in such a way that 
> its "sorted-ness" is preserved.
> In a later job, read a transaction file, sort/partition it as necessary.  
> Read the master file, preserving its sorted-ness.  Join the two DataFrames in 
> such a way that the master rows are not sorted again.
> Thanks!
> 

Reply via email to