Re: [Neo] Neo4j + Lucene

Johan Svensson Tue, 16 Jun 2009 06:47:45 -0700

On Tue, Jun 16, 2009 at 11:21 AM, Martin
Kleppmann<[email protected]> wrote:
> Hello all,
>
> I'm planning to add Lucene indexing support to the Scala/REST wrapper
> I announced yesterday. This means venturing into areas where the
> documentation is patchy... I'm looking through Andreas' Ruby library
> as a starting point on how to do things, but I have a few additional
> questions:
>


Hi Martin,

As you say the documentation outside Neo4j core is a bit patchy, going
forward improving documentation on other Neo4j components is becoming
increasingly important.

> - I would like to be able to track modifications to nodes and
> relationships and automatically submit these to the indexer, so that
> code outside doesn't have to worry about indexing. What I'm planning
> to do is to have my own classes implementing the NeoService, Node and
> Relationship interfaces, each delegating to an underlying service/node/
> relationship but tracking modifications and submitting them to an
> IndexService on transaction commit. Can you see anything wrong with
> this approach?

Nothing wrong with this approach but since you have written a Scala
wrapper on top of Neo4j shouldn't there be places there you could hook
into and do exactly this?

> If I implement all of NeoService's getters to return my
> wrapped node/relationship implementations, can I be sure that query
> code (e.g. using traversers) will always return my wrapped
> implementations? (Asked the other way round, is it possible to
> intercept every occurrence of org.neo4j.impl.core.NodeImpl and
> RelationshipImpl being instantiated?)
>

On a Neo4j API level you would have to make sure that the following
places returns your wrapped implementations:

o NeoService getters (getNodeById, getRelationshipById, getAllNodes)
o Node getters of relationships and traverse methods
o Relationship getters of nodes
o TraversalPosition used in stop and returnable evaluators for traverser

Neo4j never returns the actual NodeImpl or RelationshipImpl instances
out to the user, instead we return NodeProxy and RelationshipProxy
that just holds the node and relationship id.

> - What is the thread safety of LuceneIndexService and friends? Would
> it be right to have (a) a single instance and synchronise all threads
> on it; (b) a single instance with multi-threaded access; (c) one
> instance per thread?
>

LuceneIndexService is thread safe. A single instance of IndexService
(should be tied to one NeoService) can handle concurrent transactions
from different threads.

> - Am I right in my understanding that the Lucene index is disk-backed,
> and thus it should not be necessary to re-index the db after
> restarting the server? Do you find that it is still necessary to do a
> full rebuild of the index occasionally in case it goes out of sync? (I
> guess that when using the more loose isolation modes like
> ASYNC_OTHER_TX you might get lost updates to the indexer on abrupt
> server shut-down.)
>

Yes, the Lucene index is disk-backed and we added a transaction log so
it can be exposed as a XA-resource. This means when running default
isolation level a transaction performing write operations to the
lucene index and the Neo4j graph will execute using the two phase
commit protocol (re-index never needed).

If you use the ASYNC_OTHER_TX isolation leve there are crash scenarios
that could render the index inconsistent (since the updates won't
execute in a 2PC transaction) and then you would have to rebuild the
index.

> - What is NeoIndexService (used by the IMDB demo) about? From skimming
> the code it looks like it is represented completely in a Neo4j
> subgraph (arranged in a BTree?). What advantages does this approach
> offer over Lucene? I assume it won't support any of Lucene's more
> advanced features, such as fuzzy matching. Similarly, how do
> SortedTree/Timeline compare to a sorted Lucene index and range queries?
>

NeoIndexService is as you say represented completely in Neo4j as a
subgraph arranged in a BTree. There is no real advantage of using the
NeoIndexService over Lucene. The advantage of NeoIndexService is that
you won't have a new dependency (Lucene) and no need for 2PC (since
there's only one resource involved), but despite that we still
recommend people to use the LuceneIndexService for production systems.

If you look at the IndexService interface it only provides means to do
an exact lookup to nodes given a key and a value. The purpose of
IndexService is not to provide a generic indexing mechanism that can
do things like fuzzy matching, full text or regular expression search.

Timeline was written (originally for a specific project) to handle
large ordered lists that mostly append at end with fast iteration over
specific sections on that list is needed. Similarly the SortedTree was
developed for a project when we needed random insert but ordered full
iteration. I am sure there are other tools and solutions such as
Lucene that handles these problems well.

Finally I would just like to mention that the graph (or more
specifically the typed relationships) is the primary index. Our
experience is that a well designed graph for your domain together with
a simple "exact lookup" index will solve most of your use-cases.

Regards,
-Johan
_______________________________________________
Neo mailing list
[email protected]
https://lists.neo4j.org/mailman/listinfo/user

Re: [Neo] Neo4j + Lucene

Reply via email to