It's been 13 years since I left Chemistry, but I think I have some residual
interest in the subject :-)

My two cents worth for this problem is that it is possible to model
everything in one single graph:

   - Store both the chemical structures and the relationships as graphs,
   differentiating using relationship types.
   - Atoms connected together would have BOND relationships, but the
   molecule itself can also be represented by a single node, with an ATOMS
   relationship into the sub-graph representing the molecular structure
   - Meta-data structures related to the molecules would be a graph
   connecting them all together and connected to the root node. So, for
   example, the suggestion of a molecule having been made by John Doe and
   stored in Room 123 would be modeled by having a 'rooms' node connected to a
   node for each room, and the room node with 'name'='123' would be connected
   using STORES relationship to the molecule node above. Similarly we would
   have a 'chemists' node connected to each chemist, and the chemist with
   'name'='John Doe' would be related by CREATED to the molecule. The CREATED
   relationship could have properties like created_on, etc. John Doe could also
   have properties like email, phone number, etc.

In this approach we have no need for the external index, since all queries
suggested can be achieved using a traversal. If you want composite queries
and know in advance the main queries you will make, you can also optimise
the graph structure for those queries. For example, if a standard query is
to ask the database which molecules were created by chemist X between March
and August 2010, then create month nodes between the chemist node and the
molecule node, so all molecules made by X in January would be connected
first to X's January node and from there to X himself. This is in effect
building a custom index into the graph. It is a good solution if you know
very well what kind of queries you will make.

However, as Peter suggests, using the lucene index, especially with the new
composite query support, you do not need to think too hard about having your
own index graph, but would simply add both the chemists name and the date to
a single index on the molecule node itself. So the lucene query for the
chemist and date should return a set of molecule nodes, and you can then do
further pattern matching, if needed, on those.

One other idea I would consider for pattern matching is to generate a
signature, a kind of hash of the molecule shape that is representative of
the shape. Then you can index that hash also, and effectively get the
molecular shape to be a lucene searchable field. This is only possible if
you know your domain well enough to create a hash that makes sense for your
situation. In the case of chemistry, it really depends on what you mean by
'shape' when doing the search. For example, perhaps a search on chemical
formula is a good enough description of the shape, and in that case your
'hash' is simply the formula. So, for example, ethanol would be C2H5OH.
Searches on that hash should yield ethanol and perhaps a few similar
compounds. If we spent a little more time thinking about this, we could
possibly come up with a few better hashes, more likely to match 'shape', but
I hope you at least got my idea :-)
(I suspect that there are probably standard ways of writing down a chemical
shape uniquely, and if you get the shape hash to be truely unique, you can
also not bother to store the molecule as a sub-graph at all, saving space
and complexity).

Regards, Craig

On Tue, Nov 9, 2010 at 2:32 PM, Peter Neubauer <
peter.neuba...@neotechnology.com> wrote:

> Thomas,
> IMHO, the examination of the graphs should be much helped by the new
> Index API, where you can ask and store composite indexes. I would
> imagine that you could do a lot of the exclusion work by indexing the
> chemical structures by not only one node, but possibly construct a
> typical path of nodes and relationships and index that one with
> http://wiki.neo4j.org/content/Index_Framework#Compound_queries, that
> that you can ask complex queries involving the whole structure, and
> get the "entry node" for the subgraph back. Also, that entry node
> could be used to connect to e.g. John Doe in order to represent the
> whole compound.
>
> Would that be feasible?
>
> Cheers,
>
> /peter neubauer
>
> GTalk:      neubauer.peter
> Skype       peter.neubauer
> Phone       +46 704 106975
> LinkedIn   http://www.linkedin.com/in/neubauer
> Twitter      http://twitter.com/peterneubauer
>
> http://www.neo4j.org               - Your high performance graph database.
> http://www.thoughtmade.com - Scandinavia's coolest Bring-a-Thing party.
>
>
>
> On Tue, Nov 9, 2010 at 10:47 AM, Thomas Strunz <beginn...@hotmail.de>
> wrote:
> >
> > Hi all,
> >
> > I have following questions:
> >
> > is neo4j also suited for a database, that contains many 100k of small
> graphs (5-30 nodes, mostly around 1-4 relationships per node)? (As far as I
> understood not the main purpose of the product but doesn't hurt to ask)
> >
> > If yes how can you perform subgraph matching and whats it's performance?
> (especially considering that most nodes are the same and the relationship
> types between them too)
> > To be specific: graph = chemical Structure (mainly C and H Atoms (nodes)
> connected by bonds (single, double,..)
> >
> > A query typically only contains nodes and relationships that appear in
> 100% of the "small graphs" and multiple times per graph.
> > I read
> >
> > http://lists.neo4j.org/pipermail/user/2009-June/001331.html
> >
> > and this seems to hint it will be rather tricky to achieve this? (defines
> the entry point, and only enter each "small graph" once)
> >
> > Note that prior filtering steps unrelated to graphs must be done
> previously anyway and hence the number of "small graphs" to traverse is
> usually much lower than the total number.
> >
> >
> > And an additional question:
> >
> > Can a node be a traversable graph too?
> > Example: chemical Structure XYZ (a graph) was made by John Doe and is
> stored in Room 123.
> > (the chemical Structure XYZ must be seen as a single object (=Node) for
> the additional context).
> > Query would be: find all chemical Structures made by John Doe that match
> a given chemical Structure
> >
> > I hope it's understandable what i'm tryign to get at.
> >
> > Best Regards,
> >
> > Thomas
> >
> >
> > _______________________________________________
> > Neo4j mailing list
> > User@lists.neo4j.org
> > https://lists.neo4j.org/mailman/listinfo/user
> >
> _______________________________________________
> Neo4j mailing list
> User@lists.neo4j.org
> https://lists.neo4j.org/mailman/listinfo/user
>
_______________________________________________
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user

Reply via email to