On Tue, Jun 16, 2009 at 11:21 AM, Martin Kleppmann<[email protected]> wrote: > Hello all, > > I'm planning to add Lucene indexing support to the Scala/REST wrapper > I announced yesterday. This means venturing into areas where the > documentation is patchy... I'm looking through Andreas' Ruby library > as a starting point on how to do things, but I have a few additional > questions: >
Hi Martin, As you say the documentation outside Neo4j core is a bit patchy, going forward improving documentation on other Neo4j components is becoming increasingly important. > - I would like to be able to track modifications to nodes and > relationships and automatically submit these to the indexer, so that > code outside doesn't have to worry about indexing. What I'm planning > to do is to have my own classes implementing the NeoService, Node and > Relationship interfaces, each delegating to an underlying service/node/ > relationship but tracking modifications and submitting them to an > IndexService on transaction commit. Can you see anything wrong with > this approach? Nothing wrong with this approach but since you have written a Scala wrapper on top of Neo4j shouldn't there be places there you could hook into and do exactly this? > If I implement all of NeoService's getters to return my > wrapped node/relationship implementations, can I be sure that query > code (e.g. using traversers) will always return my wrapped > implementations? (Asked the other way round, is it possible to > intercept every occurrence of org.neo4j.impl.core.NodeImpl and > RelationshipImpl being instantiated?) > On a Neo4j API level you would have to make sure that the following places returns your wrapped implementations: o NeoService getters (getNodeById, getRelationshipById, getAllNodes) o Node getters of relationships and traverse methods o Relationship getters of nodes o TraversalPosition used in stop and returnable evaluators for traverser Neo4j never returns the actual NodeImpl or RelationshipImpl instances out to the user, instead we return NodeProxy and RelationshipProxy that just holds the node and relationship id. > - What is the thread safety of LuceneIndexService and friends? Would > it be right to have (a) a single instance and synchronise all threads > on it; (b) a single instance with multi-threaded access; (c) one > instance per thread? > LuceneIndexService is thread safe. A single instance of IndexService (should be tied to one NeoService) can handle concurrent transactions from different threads. > - Am I right in my understanding that the Lucene index is disk-backed, > and thus it should not be necessary to re-index the db after > restarting the server? Do you find that it is still necessary to do a > full rebuild of the index occasionally in case it goes out of sync? (I > guess that when using the more loose isolation modes like > ASYNC_OTHER_TX you might get lost updates to the indexer on abrupt > server shut-down.) > Yes, the Lucene index is disk-backed and we added a transaction log so it can be exposed as a XA-resource. This means when running default isolation level a transaction performing write operations to the lucene index and the Neo4j graph will execute using the two phase commit protocol (re-index never needed). If you use the ASYNC_OTHER_TX isolation leve there are crash scenarios that could render the index inconsistent (since the updates won't execute in a 2PC transaction) and then you would have to rebuild the index. > - What is NeoIndexService (used by the IMDB demo) about? From skimming > the code it looks like it is represented completely in a Neo4j > subgraph (arranged in a BTree?). What advantages does this approach > offer over Lucene? I assume it won't support any of Lucene's more > advanced features, such as fuzzy matching. Similarly, how do > SortedTree/Timeline compare to a sorted Lucene index and range queries? > NeoIndexService is as you say represented completely in Neo4j as a subgraph arranged in a BTree. There is no real advantage of using the NeoIndexService over Lucene. The advantage of NeoIndexService is that you won't have a new dependency (Lucene) and no need for 2PC (since there's only one resource involved), but despite that we still recommend people to use the LuceneIndexService for production systems. If you look at the IndexService interface it only provides means to do an exact lookup to nodes given a key and a value. The purpose of IndexService is not to provide a generic indexing mechanism that can do things like fuzzy matching, full text or regular expression search. Timeline was written (originally for a specific project) to handle large ordered lists that mostly append at end with fast iteration over specific sections on that list is needed. Similarly the SortedTree was developed for a project when we needed random insert but ordered full iteration. I am sure there are other tools and solutions such as Lucene that handles these problems well. Finally I would just like to mention that the graph (or more specifically the typed relationships) is the primary index. Our experience is that a well designed graph for your domain together with a simple "exact lookup" index will solve most of your use-cases. Regards, -Johan _______________________________________________ Neo mailing list [email protected] https://lists.neo4j.org/mailman/listinfo/user

