[Neo] merging databases to enable bulk load on live database

Craig Taverner Thu, 29 Oct 2009 14:51:37 -0700

Hi,

Recently Johan pointed out to me that merging databases was not something
easily added to the core neo4j because there would be clashes of ids.
However, I have thought a bit more about it, and I think there is a way this
can work (in theory) for special cases.


Here is my example:

   - We have a main database that contains most data. Node Id's presumably
   increment from 1 and up as data is added.
   - We wish to bulk load data into this database while it is still active
   (not allowed by bulk load)
   - We instead bulk load into a completely new database, but one where the
   id's are set to start at a high number, set well above the current max-id of
   the main database
   - Once bulk load is finished, we new database is appended to the main one
   with no id clashes

Issues I can imagine with this approach:

   - We need to be sure the main database does not grow enough to create an
   id clash during the time of the bulk load. I would think the offset is
   something the application programmer needs to choose based on knowledge of
   the behaviour of the application in this regard. For example, if the only
   way the application adds large numbers of nodes is through the bulk load,
   the programmer knows the main ids will not grow as long as only one bulk
   load runs at a time. Worst case scenario, the merge is rejected and the
   application has to repeat the entire load with a new offset.
   - If id's are actually pure array indexes to the data, then current neo4j
   code would not support indexing from a high number. I would imagine it easy
   to have a database wide offset to deal with this.
   - If id's are array indexes, after the merge there would be a possibly
   large chunk of space in the database (after the end of the main data and
   before the beginning of the new data).
   - This trick is probably necessary for all database files, nodes,
   relationships and properties.
   - After the merge the application code needs all Node instances to still
   be valid, for both databases, but presumably the NeoService instance needs
   to be merged to point to one object. The application node should also take
   care to create appropriate links between old and new data, assuming that is
   needed by the application (it is in my case).

What do people think of this approach? If it is possible, it would certainly
solve my problem with needing to run bulk loads on a live database. I
personally do not think the additional API components would be too complex.
In fact all that is needed are three API additions:

   - Ability to get the current max node id from the main database (perhaps
   already exists?)
   - Ability to set the offset for the ids in the EmbeddedNeo() constructor
   - Method on NeoService (or probably only on EmbeddedNeo) for merging in
   another database, for example, mainNeo.append(tempNeo). This would
   presumably return a boolean, or throw exceptions. I expect the tempNeo
   instance becomes invalid after this call (or points to the same database as
   mainNeo). Node instances on either database need to remain valid (but point
   to the main database), so we can add relations immediately to link the
   datasets correctly.

Cheers, Craig
_______________________________________________
Neo mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user

[Neo] merging databases to enable bulk load on live database

Reply via email to