Can you split the files beforehand in several files (e.g. By the column you do the join on?) ?
> On 10 Nov 2016, at 23:45, Stuart White <stuart.whi...@gmail.com> wrote: > > I have a large "master" file (~700m records) that I frequently join smaller > "transaction" files to. (The transaction files have 10's of millions of > records, so too large for a broadcast join). > > I would like to pre-sort the master file, write it to disk, and then, in > subsequent jobs, read the file off disk and join to it without having to > re-sort it. I'm using Spark SQL, and my understanding is that the Spark > Catalyst Optimizer will choose an optimal join algorithm if it is aware that > the datasets are sorted. So, the trick is to make the optimizer aware that > the master file is already sorted. > > I think SPARK-12394 provides this functionality, but I can't seem to put the > pieces together for how to use it. > > Could someone possibly provide a simple example of how to: > Sort a master file by a key column and write it to disk in such a way that > its "sorted-ness" is preserved. > In a later job, read a transaction file, sort/partition it as necessary. > Read the master file, preserving its sorted-ness. Join the two DataFrames in > such a way that the master rows are not sorted again. > Thanks! >