Best way to do a lookup in Spark

Ashic Mahtab Thu, 27 Nov 2014 05:38:33 -0800

Hi,
I'm looking to do an iterative algorithm implementation with data coming in 
from Cassandra. This might be a use case for GraphX, however the ids are 
non-integral, and I would like to avoid a mapping (for now). I'm doing a simple 
hubs and authorities HITS implementation, and the current implementation does a 
lot of db access. It's fine (one half of a full iteration is done in 25 minutes 
on 3M+ vertices), and use of Spark's cache() has achieved that. However, each 
full iteration is 50 minutes, and I would like to improve that.


A high level overview of what I'm trying to do is:

1) Vertex structure (id, in, out, aScore, hScore).
2) Load all the vertices into memory (simple enough).
3) Have a lookup vertexid -> (aScore, hScore) in memory (currently, this is 
where I need to do a lot of cassandra queries...which are very fast, but hoping 
to avoid).
4) Iterate n times in 2 statges:
    In the Hub Stage:
    a) Foreach vertex, get the sum of aScores for vertices it points to. Cache 
this.
    b) From the cache, get the max score. Divide each score in the cache by the 
max.
    c) Get rid of the cache.
    d) Update the lookup (in (3)) with the new hScores.
   
    In the Authority Stage:
    a) Foreach vertex, get the sum of hScores for vertices that point to it. 
Cache this.
    b) From the cache, get the max score. Divide each score in the cache by the 
max.
    c) Get rid of the cache.
    d) Update the lookup (in (3)) with the new aScores.
 
5) Update the final aScores and hScores from memory to Cassandra.

The one bit that I don't have now is the in memory lookup (i.e. to get the 
hScores and aScores of neighbours in (4-a ). As such, I have to query cassandra 
for each vertex x times where x is the number of neighbours. And as those 
values are used in the next iteration, I also have to update cassandra for each 
run. Is it possibly to have this as an in memory distributed lookup so that I 
can deal with the data store at the start and end?

One option is to identify clusters and run HITS for each cluster entirely in 
memory, however if there's a simpler way I'd prefer that.

Regards,
Ashic.

Best way to do a lookup in Spark

Reply via email to