Apologies if I’ve asked this question before but I didn’t see it in the list
and I’m certain that my last surviving brain cell has gone on strike over my
attempt to reduce my caffeine intake…
Posting this to both user and dev because I think the question / topic jumps in
to both camps.
Again since I’m a relative newbie on spark… I may be missing something so
apologies up front…
With respect to Spark SQL, in pre 2.0.x, there were only hash joins? In post
2.0.x you have hash, semi-hash , and sorted list merge.
For the sake of simplicity… lets forget about cross product joins…
Has anyone looked at how we could use inverted tables to improve query
The issue is that when you have a data sewer (lake) , what happens when your
use case query is orthogonal to how your data is stored? This means full table
By using secondary indexes, we can reduce this albeit at a cost of increasing
your storage footprint by the size of the index.
Are there any JIRAs open that discuss this?
Indexes to assist in terms of ‘predicate push downs’ (using the index when a
field in a where clause is indexed) rather than performing a full table scan.
Indexes to assist in the actual join if the join column is on an indexed column?
In the first, using an inverted table to produce a sort ordered set of row keys
that you would then use in the join process (same as if you produced the subset
based on the filter.)
To put this in perspective… here’s a dummy use case…
CCCis (CCC) is the middle man in the insurance industry. They have a piece of
software that sits in the repair shop (e.g Joe’s Auto Body) and works with
multiple insurance carriers.
The primary key in their data is going to be Insurance Company | Claim ID.
This makes it very easy to find a specific claim for further processing.
Now lets say I want to do some analysis on determining the average cost of
repairing a front end collision of a Volvo S80?
Break down the number and types of accidents by car manufacturer , model and
color. (Then see if there is any correlation between car color and # and type
As you can see, all of these queries are orthogonal to my storage. So I need
to create secondary indexes to help sift thru the data efficiently.
Does this make sense?
Please Note: I did some work for CCC back in the late 90’s. Any resemblance to
their big data efforts is purely coincidence and you can replace CCC with
Allstate, Progressive, StateFarm or some other auto insurance company …