Hello list, I'd like to hear some suggestions from you about my data model. As a baseline application, I'm writing a graphdb API on top of HBase. The idea is to implement Tinkerpop's Blueprints (https://github.com/tinkerpop/blueprints/wiki/) API. I know that building a graphdb over hbase might not be a great idea, so please let's skip this part :). Part of the importance of this model is to be able to traverse the graph efficiently: given a vertex get the list of adjancent vertices connected by an edge of a given type ([1] i.e. Vertex[name=claudio]=>OutgoingEdge[knows]=>Vertex[gender=female]... ). The rest of the API is pretty straightforward, get/set properties to vertices and edges, add edges, add vertices, get neighborhoods of vertices.
Here's the data model: key: vertexid (uuid) Family "Properties:": <property name>=><property value>, ... Family "OutgoingEdges:": <edge key>=><other vertexid>, ... Family "IncomingEdges:": same as outgoing edges... This table allows me to fetch quickly the properties of a vertex and it's adjacency list. I can't use the vertexid as the other endpoint because multiple edges (with different types) can connect the same two vertices. key: edge key (composite(<source vertexid>, <destination vertexid>, <edge typename>)) (i.e. vertexid1_vertexid2_knows) Family "Properties:": <property name>=><property value>, ... This table allows me to fetch quickly the properties of an edge. key: composite(<source vertexid>, "out|in", <edge typename>) (i.e. vertexid1_knows) Family "Neighbor:": <destination vertexid>=>null, This table allows me to search/scan for edges that are either incoming or outgoing from a vertex and belong to specific type and would be the core of the traversing ability of the API (so i want it to be as fast as possible both in terms of network I/O (RPCs), disk I/O (seek)). It should also "scale" on the size of the graph, meaning that with the growth of the graph the cost of this type of operation should depend on the number of edges outgoing from the vertex and not on the total number of vertices and edges. The example above [1] i'd considering vertexid1 the source vertex with property name:claudio i'd scan vertexid1_knows and receive the list of vertices connected. After that i can scan on the column "Properties:gender" on these vertices and look for those having the "female" value. Questions: 1) General: do you see a better data model for my operations? 2) Can i fit everything in one table where for certain keys some families would be empty (i.e. the "OutgoingEdges:" family would not make sense for the edges)? I'd like that because as you can see all the keys are composed by the vertexid uuid prefix, so they would be very compact and fit mostly on the same regionserver. 3) I guess that for the scanning I'd make extensive use of Filters. I guess regexp Filter will be my friend. Do you have concerns about performance of filters applied to this data model? Thank you very much Claudio -- Claudio Martella Digital Technologies Unit Research & Development - Analyst TIS innovation park Via Siemens 19 | Siemensstr. 19 39100 Bolzano | 39100 Bozen Tel. +39 0471 068 123 Fax +39 0471 068 129 [email protected] http://www.tis.bz.it
