Re: [Neo] Noob questions/comments

Craig Taverner Thu, 10 Dec 2009 07:51:17 -0800

I'd like to comment on the question of indexing event logs. We are also
dealing with event logs and we query them based on time windows as well as
other properties. In our case they are also located on a map, and
categorized using event types. So we have three types of indexes in use:


   - time index (using long values, similar to the TimelineIndex Johan
   mentioned, but we coded one ourselves)
   - spatial index (using two double values, based on the same mechanism for
   the time index, but in 2D)
   - category index (we just create a list of category nodes and link the
   events to the relevant category)

All of these indexes are simply nodes that the event stream nodes link to.
For numerical indexes like the time and spatial indexes we use tree
structures (not B-tree, but usually multiple children per-parent, suitable
for uneven density data). The first level of the tree (closest to the data)
is chosen with a resolution close to common queries (in our case the events
occur many times per second, but queries are usually at second resolution,
so the first index is of second resolution).

You had a very important question about combined indexes, for example
querying on timestamp and category in the same query with high performance.
Currently we do not have need for that in our system, but we have
brainstormed a nice solution to this, so I thought I'd mention it here in
case it is useful. There are two options:

   - If one of the criteria is very limiting all the time, for example
   querying on time-window always returns a small set, then query that first
   and do a slow search for the linked categories. This adds no additional
   complexity to the database, but makes assumptions about the queries, and
   only performs well if these assumptions are true.
   - Otherwise you can build an combined index by connecting the tree nodes
   of one index to the tree nodes of another. In the case of time and category
   indices, each of the nodes in the B-tree or multi-tree time index would be
   connected to all the categories for which its underlying data nodes belong.
   Then when traversing the time tree, you can test for both the time-window
   constraints and the category constraints, and exit the search if either
   fail. We have considered the possibility of building these structures on
   demand, based on actual queries, so the first query that works with any two
   constraints would search on one, and then build the combined index for both.
   This allows subsequent searches to run very fast, without needing to build
   all possible combinations of combined index (assuming many single property
   indices exist).

-          One aspect of our application will store nodes that can be
> considered similar to event logs.  There may be many thousands of these
> nodes per "event stream".  We would like to be able to traverse the entries
> in chronological order, very quickly.  We were considering the following
> design possibilities:
>
> o   Simply create a node for each "stream" and a node for each entry, with
> a
> relationship between the stream and the entry, then implement our own sort
> routine
>

Our approach is to create a node for each entry, and index using time and
spatial indices. The first level of index is another stream of data, ordered
by the relevant property, and traversable in that order (eg. time order).

o   Similar to the above, but create a node for each "day", and manage
> relationships to allow traversal by stream and/or day
>

In our approach, each level in the index tree represents a higher level of
granularity. We go up in fixed steps (multiples). A B-tree steps 2X. We tend
to step 10X, because that gives isosceles pyramid trees. But you might
prefer to step in known temporal quantities, seconds, minutes, hours, days,
weeks, months, etc. That will improve search performance if your common
queries are exact multiples of the different index levels.


> o   Create a node for each stream, a node for each entry and treat the
> entries as a forward-only linked list using relationships between the
> entries (and of course a relationship between the stream and the "first"
> entry)
>

We tend to create relationships for all common query or traversal paths,
with different relationship types in all cases. So traversing the original
data would use 'next'. Traversing down the index would be 'child', or
perhaps 'index-child' if there is ambiguity. etc.

-          Anyone used any kind of intermediate index or other approach to
> bridge multiple Neo instances?
>

Hmm... I think it was this question that got me started on the combined
index discussion above, but now that I re-read it, I see it has nothing to
do with combined indices. I've thought a bit about bridging indices, but
have nothing really useful to offer here. Sorry. Hope the long discussion
above still has some value :-(
_______________________________________________
Neo mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user

Re: [Neo] Noob questions/comments

Reply via email to