Hello list,

I'd like to hear some suggestions from you about my data model. As a
baseline application, I'm writing a graphdb API on top of HBase. The
idea is to implement Tinkerpop's Blueprints
(https://github.com/tinkerpop/blueprints/wiki/) API. I know that
building a graphdb over hbase might not be a great idea, so please let's
skip this part :). Part of the importance of this model is to be able to
traverse the graph efficiently: given a vertex get the list of adjancent
vertices connected by an edge of a given type ([1] i.e.
Vertex[name=claudio]=>OutgoingEdge[knows]=>Vertex[gender=female]... ).
The rest of the API is pretty straightforward, get/set properties to
vertices and edges, add edges, add vertices, get neighborhoods of vertices.

Here's the data model:

key: vertexid (uuid)
Family "Properties:": <property name>=><property value>, ...
Family "OutgoingEdges:": <edge key>=><other vertexid>, ...
Family "IncomingEdges:": same as outgoing edges...

This table allows me to fetch quickly the properties of a vertex and
it's adjacency list. I can't use the vertexid as the other endpoint
because multiple edges (with different types) can connect the same two
vertices.

key: edge key (composite(<source vertexid>, <destination vertexid>,
<edge typename>)) (i.e. vertexid1_vertexid2_knows)
Family "Properties:": <property name>=><property value>, ...

This table allows me to fetch quickly the properties of an edge.

key: composite(<source vertexid>, "out|in", <edge typename>) (i.e.
vertexid1_knows)
Family "Neighbor:": <destination vertexid>=>null,

This table allows me to search/scan for edges that are either incoming
or outgoing from a vertex and belong to specific type and would be the
core of the traversing ability of the API (so i want it to be as fast as
possible both in terms of network I/O (RPCs), disk I/O (seek)). It
should also "scale" on the size of the graph, meaning that with the
growth of the graph the cost of this type of operation should depend on
the number of edges outgoing from the vertex and not on the total number
of vertices and edges.
The example above [1] i'd considering vertexid1 the source vertex with
property name:claudio i'd scan vertexid1_knows and receive the list of
vertices connected. After that i can scan on the column
"Properties:gender" on these vertices and look for those having the
"female" value.

Questions:

1) General: do you see a better data model for my operations?
2) Can i fit everything in one table where for certain keys some
families would be empty (i.e. the "OutgoingEdges:" family would not make
sense for the edges)? I'd like that because as you can see all the keys
are composed by the vertexid uuid prefix, so they would be very compact
and fit mostly on the same regionserver.
3) I guess that for the scanning I'd make extensive use of Filters. I
guess regexp Filter will be my friend. Do you have concerns about
performance of filters applied to this data model?
 

Thank you very much


Claudio

-- 
Claudio Martella
Digital Technologies
Unit Research & Development - Analyst

TIS innovation park
Via Siemens 19 | Siemensstr. 19
39100 Bolzano | 39100 Bozen
Tel. +39 0471 068 123
Fax  +39 0471 068 129
[email protected] http://www.tis.bz.it

Reply via email to