Re: [Neo4j] Creating and managing external index
Avi, we are in the process to get out a nicer base framework for transactional index creation, and an index provider for redis. Meanwhile, if you want, you could look into the BerkelyDB index that I tried to cook together (no guarantees there), https://github.com/peterneubauer/bdb-index and see if that is something to contemplate? Cheers, /peter neubauer GTalk: neubauer.peter Skype peter.neubauer Phone +46 704 106975 LinkedIn http://www.linkedin.com/in/neubauer Twitter http://twitter.com/peterneubauer http://www.neo4j.org - NOSQL for the Enterprise. http://startupbootcamp.org/ - Öresund - Innovation happens HERE. On Mon, Nov 21, 2011 at 2:40 AM, Avi Shai unicornrainbowche...@gmail.com wrote: What is the best way to create an external index but only for certain nodes? Really I want something like the in-graph data structures, but instead it will be stored in another database(s). I am in essence indexing only a sub-graph or a straight list of nodes. I then want to use these indexes as entry points in some cases rather than traversing. I understand that there is already Lucene, but I have data that is better suited to other indexes. I still want to use Lucene for full-text, just not for anything else. I am currently taking a stab at implementing the blueprint index interfaces (manual, automatic), but for another purpose. If I am always updating these indexes, but only for certain vertex types, what is the best integration point? In my data service classes/lower level neo4j stuff, or in a server event handler to plug-in the transaction? What about for all vertices? I guess I understand how to write the index classes but not about the best way of consuming them, and not if they apply well for lots of partial, smaller indexes. For instance, I want to store data as temporal values, with the most recent data first for a group of nodes. I'm not doing Twitter or a blog, but either is a good enough analogy. If I post something with a given tag, I want to index all the nodes that have been tagged by that tag (tag edge) in temporal order for example to create a recently tagged feed or a recently seen users feed that contains the users that have recently tagged using that tag. I could store this data in Redis exactly how I want and have a hot set in memory that can then be used either directly in some pages in my app, or as an entry point into neo4j for more complex queries. These indexes probably require lots of writes and I wanted to also avoid locking related nodes on any updates. Currently part of the reason I'm doing this is I have lots of super nodes in my design. I've patched this some by keeping counts in node properties and adding proxy nodes as mini-partions to reduce the number of relationships. I've also looked at things like combining common nodes together as junctions, but there are too many permutations to scale probably. Anyway, if I use in-graph indexes, I have to update my indexes every insertion or update. I'm going to try out indexed relationships, and I think it will help, but with respect, I don't think it will scale well or fit my use cases, especially for indexes where data drops out because the size is fixed (like a fixed list). I feel that creating index structures in the graph is nice, but it will severely balloon the graph. Moreover, I want to save resources on the servers running neo for graph traversals and other graph activities and I would rather use other clustered servers to store huge amounts of index data in memory. One other idea is to use another neo4j instance as an index to itself, but I think the characteristics of what I am doing are better suited in some cases for Redis (temporal lists) or Mongo (hierarchical metrics) depending the use-case. Example: pulling down linear lists of time-data by page and sorting front to back or back to front. I know that's a lot, but I wanted to at least give some detail beyond what I've already read here in all the old posts I've dug through this week. Any feedback? Thanks. -- View this message in context: http://neo4j-community-discussions.438527.n3.nabble.com/Creating-and-managing-external-index-tp3523613p3523613.html Sent from the Neo4j Community Discussions mailing list archive at Nabble.com. ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] Creating and managing external index
Peter, A redis index provider sounds great. Blue Redis (Redis Blueprints) has some ideas in it, but I think they can be improved on and adapted to neo4j. Transactional indexes creation sounds like what I need. I have also thought about if I am doing something like tagging a resource, there are several points of interest involved including the event, the resource, the event source, and the user. I would want a temporal index updated for each of those perhaps, but I often think making the user wait on the index updates is not the behavior I want. Instead, my idea was to do the insert of the core data (tag event), and then fire something into an index update queue. Items could be taken often immediately most of the time, but it would avoid forcing the user to wait until all the indexes were updated after creating or updating a node. I suppose also what I could do is have some sort of retry system just in case updating the indexes for anything involved fails. The tough part is the temporal indexes and ensuring they are sorted right without actually scanning them much or at all. Other kinds of indexes such as b-tree inserts are a bit more friendly to this sort of system. My worry with inserting into an index in a normal transaction is not only wait, but constantly locking the same nodes if using any in-graph indexes on a hot spot. An example might be if I am constantly updating and indexing a discussion. I don't want replies to have to contend for the same index node. That said, most of the time it is probably a non-issue since the index writes happen so fast anyway. -- View this message in context: http://neo4j-community-discussions.438527.n3.nabble.com/Creating-and-managing-external-index-tp3523613p3527899.html Sent from the Neo4j Community Discussions mailing list archive at Nabble.com. ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
[Neo4j] Creating and managing external index
What is the best way to create an external index but only for certain nodes? Really I want something like the in-graph data structures, but instead it will be stored in another database(s). I am in essence indexing only a sub-graph or a straight list of nodes. I then want to use these indexes as entry points in some cases rather than traversing. I understand that there is already Lucene, but I have data that is better suited to other indexes. I still want to use Lucene for full-text, just not for anything else. I am currently taking a stab at implementing the blueprint index interfaces (manual, automatic), but for another purpose. If I am always updating these indexes, but only for certain vertex types, what is the best integration point? In my data service classes/lower level neo4j stuff, or in a server event handler to plug-in the transaction? What about for all vertices? I guess I understand how to write the index classes but not about the best way of consuming them, and not if they apply well for lots of partial, smaller indexes. For instance, I want to store data as temporal values, with the most recent data first for a group of nodes. I'm not doing Twitter or a blog, but either is a good enough analogy. If I post something with a given tag, I want to index all the nodes that have been tagged by that tag (tag edge) in temporal order for example to create a recently tagged feed or a recently seen users feed that contains the users that have recently tagged using that tag. I could store this data in Redis exactly how I want and have a hot set in memory that can then be used either directly in some pages in my app, or as an entry point into neo4j for more complex queries. These indexes probably require lots of writes and I wanted to also avoid locking related nodes on any updates. Currently part of the reason I'm doing this is I have lots of super nodes in my design. I've patched this some by keeping counts in node properties and adding proxy nodes as mini-partions to reduce the number of relationships. I've also looked at things like combining common nodes together as junctions, but there are too many permutations to scale probably. Anyway, if I use in-graph indexes, I have to update my indexes every insertion or update. I'm going to try out indexed relationships, and I think it will help, but with respect, I don't think it will scale well or fit my use cases, especially for indexes where data drops out because the size is fixed (like a fixed list). I feel that creating index structures in the graph is nice, but it will severely balloon the graph. Moreover, I want to save resources on the servers running neo for graph traversals and other graph activities and I would rather use other clustered servers to store huge amounts of index data in memory. One other idea is to use another neo4j instance as an index to itself, but I think the characteristics of what I am doing are better suited in some cases for Redis (temporal lists) or Mongo (hierarchical metrics) depending the use-case. Example: pulling down linear lists of time-data by page and sorting front to back or back to front. I know that's a lot, but I wanted to at least give some detail beyond what I've already read here in all the old posts I've dug through this week. Any feedback? Thanks. -- View this message in context: http://neo4j-community-discussions.438527.n3.nabble.com/Creating-and-managing-external-index-tp3523613p3523613.html Sent from the Neo4j Community Discussions mailing list archive at Nabble.com. ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user