Hi, I'm looking to do an iterative algorithm implementation with data coming in from Cassandra. This might be a use case for GraphX, however the ids are non-integral, and I would like to avoid a mapping (for now). I'm doing a simple hubs and authorities HITS implementation, and the current implementation does a lot of db access. It's fine (one half of a full iteration is done in 25 minutes on 3M+ vertices), and use of Spark's cache() has achieved that. However, each full iteration is 50 minutes, and I would like to improve that.
A high level overview of what I'm trying to do is: 1) Vertex structure (id, in, out, aScore, hScore). 2) Load all the vertices into memory (simple enough). 3) Have a lookup vertexid -> (aScore, hScore) in memory (currently, this is where I need to do a lot of cassandra queries...which are very fast, but hoping to avoid). 4) Iterate n times in 2 statges: In the Hub Stage: a) Foreach vertex, get the sum of aScores for vertices it points to. Cache this. b) From the cache, get the max score. Divide each score in the cache by the max. c) Get rid of the cache. d) Update the lookup (in (3)) with the new hScores. In the Authority Stage: a) Foreach vertex, get the sum of hScores for vertices that point to it. Cache this. b) From the cache, get the max score. Divide each score in the cache by the max. c) Get rid of the cache. d) Update the lookup (in (3)) with the new aScores. 5) Update the final aScores and hScores from memory to Cassandra. The one bit that I don't have now is the in memory lookup (i.e. to get the hScores and aScores of neighbours in (4-a ). As such, I have to query cassandra for each vertex x times where x is the number of neighbours. And as those values are used in the next iteration, I also have to update cassandra for each run. Is it possibly to have this as an in memory distributed lookup so that I can deal with the data store at the start and end? One option is to identify clusters and run HITS for each cluster entirely in memory, however if there's a simpler way I'd prefer that. Regards, Ashic.