Hi Joe A while ago I was running a Titan + HBase datastore to store graph data. I then used Spark (via TitanHBaseInputFormat, you could use the Cassandra version) to access a RDD[Vertex] that I performed analytics and machine learning on. That could form the basis of putting the data into a form usable in GraphX.
The talk here gives a bit of info on this including a little code snippet: https://spark-summit.org/2014/using-spark-and-shark-to-power-a-real-time-recommendation-and-customer-intelligence-platform Titan also provides Faunus (or I think it is now Gremlin-Hadoop), though that is Hadoop-only at the moment. On Tue, Jan 26, 2016 at 10:19 PM, Joe Bako <jb...@gracenote.com> wrote: > I’ve found some references online to various implementations (such as > Dendrite) leveraging HDFS via TitanDB + HBase for graph processing. > GraphLab also uses HDFS/Hadoop. I am wondering if (and how) one might use > TitanDB + Cassandra as the data source for Spark GraphX? The Gremlin > language seems more targeted towards basic traversals rather than > analytics, and I’m unsure the performance of attempting to use Gremlin to > load sub-graphs up into GraphX for analysis. For example, if I have a > large property graph and wish to run algorithms to find similar sub-graphs > within, would TitanDB/Gremlin even be a consideration? The underlying data > model that Titan uses in Cassandra does not seem accessible for direct > querying via CQL/Thrift. > > Any guidance around this nebulous subject is much appreciated! > > Joe Bako > Software Architect > Gracenote, Inc. > Mobile: 925.818.2230 > http://www.gracenote.com/ > > [cid:24DDC72C-B607-4624-9CB7-8DB5E866F2BF] > >