Re: [Neo4j] Creating and managing external index

2011-11-22 Thread Peter Neubauer
Avi,
we are in the process to get out a nicer base framework for
transactional index creation, and an index provider for redis.
Meanwhile, if you want, you could look into the BerkelyDB index that I
tried to cook together (no guarantees there),
https://github.com/peterneubauer/bdb-index and see if that is
something to contemplate?

Cheers,

/peter neubauer

GTalk:      neubauer.peter
Skype       peter.neubauer
Phone       +46 704 106975
LinkedIn   http://www.linkedin.com/in/neubauer
Twitter      http://twitter.com/peterneubauer

http://www.neo4j.org              - NOSQL for the Enterprise.
http://startupbootcamp.org/    - Öresund - Innovation happens HERE.



On Mon, Nov 21, 2011 at 2:40 AM, Avi Shai
unicornrainbowche...@gmail.com wrote:
 What is the best way to create an external index but only for certain nodes?
 Really I want something like the in-graph data structures, but instead it
 will be stored in another database(s). I am in essence indexing only a
 sub-graph or a straight list of nodes. I then want to use these indexes as
 entry points in some cases rather than traversing.

 I understand that there is already Lucene, but I have data that is better
 suited to other indexes. I still want to use Lucene for full-text, just not
 for anything else. I am currently taking a stab at implementing the
 blueprint index interfaces (manual, automatic), but for another purpose. If
 I am always updating these indexes, but only for certain vertex types, what
 is the best integration point? In my data service classes/lower level neo4j
 stuff, or in a server event handler to plug-in the transaction? What about
 for all vertices? I guess I understand how to write the index classes but
 not about the best way of consuming them, and not if they apply well for
 lots of partial, smaller indexes.

 For instance, I want to store data as temporal values, with the most recent
 data first for a group of nodes. I'm not doing Twitter or a blog, but
 either is a good enough analogy.  If I post something with a given tag, I
 want to index all the nodes that have been tagged by that tag (tag edge) in
 temporal order for example to create a recently tagged feed or a recently
 seen users feed that contains the users that have recently tagged using
 that tag. I could store this data in Redis exactly how I want and have  a
 hot set in memory that can then be used either directly in some pages in my
 app, or as an entry point into neo4j for more complex queries. These indexes
 probably require lots of writes and I wanted to also avoid locking related
 nodes on any updates.

 Currently part of the reason I'm doing this is I have lots of super nodes in
 my design. I've patched this some by keeping counts in node properties and
 adding proxy nodes as mini-partions to reduce the number of relationships.
 I've also looked at things like combining common nodes together as
 junctions, but there are too many permutations to scale probably. Anyway, if
 I use in-graph indexes, I have to update my indexes every insertion or
 update. I'm going to try out indexed relationships, and I think it will
 help, but with respect, I don't think it will scale well or fit my use
 cases, especially for indexes where data drops out because the size is fixed
 (like a fixed list).

 I feel that creating index structures in the graph is nice, but it will
 severely balloon the graph. Moreover, I want to save resources on the
 servers running neo for graph traversals and other graph activities and I
 would rather use other clustered servers to store huge amounts of index data
 in memory. One other idea is to use another neo4j instance as an index to
 itself, but I think the characteristics of what I am doing are better suited
 in some cases for Redis (temporal lists) or Mongo (hierarchical metrics)
 depending the use-case. Example: pulling down linear lists of time-data by
 page and sorting front to back or back to front.

 I know that's a lot, but I wanted to at least give some detail beyond what
 I've already read here in all the old posts I've dug through this week. Any
 feedback? Thanks.


 --
 View this message in context: 
 http://neo4j-community-discussions.438527.n3.nabble.com/Creating-and-managing-external-index-tp3523613p3523613.html
 Sent from the Neo4j Community Discussions mailing list archive at Nabble.com.
 ___
 Neo4j mailing list
 User@lists.neo4j.org
 https://lists.neo4j.org/mailman/listinfo/user

___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] Creating and managing external index

2011-11-22 Thread Avi Shai
Peter,

A redis index provider sounds great. Blue Redis (Redis Blueprints) has some
ideas in it, but I think they can be improved on and adapted to neo4j.
Transactional indexes creation sounds like what I need.

I have also thought about if I am doing something like tagging a resource,
there are several points of interest involved including the event, the
resource, the event source, and the user. I would want a temporal index
updated for each of those perhaps, but I often think making the user wait on
the index updates is not the behavior I want. Instead, my idea was to do the
insert of the core data (tag event), and then fire something into an index
update queue. Items could be taken often immediately most of the time, but
it would avoid forcing the user to wait until all the indexes were updated
after creating or updating a node. 

I suppose also what I could do is have some sort of retry system just in
case updating the indexes for anything involved fails. The tough part is the
temporal indexes and ensuring they are sorted right without actually
scanning them much or at all. Other kinds of indexes such as b-tree inserts
are a bit more friendly to this sort of system.


My worry with inserting into an index in a normal transaction is not only
wait, but constantly locking the same nodes if using any in-graph indexes on
a hot spot. An example might be if I am constantly updating and indexing a
discussion. I don't want replies to have to contend for the same index node.
That said, most of the time it is probably a non-issue since the index
writes happen so fast anyway.

--
View this message in context: 
http://neo4j-community-discussions.438527.n3.nabble.com/Creating-and-managing-external-index-tp3523613p3527899.html
Sent from the Neo4j Community Discussions mailing list archive at Nabble.com.
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


[Neo4j] Creating and managing external index

2011-11-20 Thread Avi Shai
What is the best way to create an external index but only for certain nodes?
Really I want something like the in-graph data structures, but instead it
will be stored in another database(s). I am in essence indexing only a
sub-graph or a straight list of nodes. I then want to use these indexes as
entry points in some cases rather than traversing.

I understand that there is already Lucene, but I have data that is better
suited to other indexes. I still want to use Lucene for full-text, just not
for anything else. I am currently taking a stab at implementing the
blueprint index interfaces (manual, automatic), but for another purpose. If
I am always updating these indexes, but only for certain vertex types, what
is the best integration point? In my data service classes/lower level neo4j
stuff, or in a server event handler to plug-in the transaction? What about
for all vertices? I guess I understand how to write the index classes but
not about the best way of consuming them, and not if they apply well for
lots of partial, smaller indexes.

For instance, I want to store data as temporal values, with the most recent
data first for a group of nodes. I'm not doing Twitter or a blog, but
either is a good enough analogy.  If I post something with a given tag, I
want to index all the nodes that have been tagged by that tag (tag edge) in
temporal order for example to create a recently tagged feed or a recently
seen users feed that contains the users that have recently tagged using
that tag. I could store this data in Redis exactly how I want and have  a
hot set in memory that can then be used either directly in some pages in my
app, or as an entry point into neo4j for more complex queries. These indexes
probably require lots of writes and I wanted to also avoid locking related
nodes on any updates.

Currently part of the reason I'm doing this is I have lots of super nodes in
my design. I've patched this some by keeping counts in node properties and
adding proxy nodes as mini-partions to reduce the number of relationships.
I've also looked at things like combining common nodes together as
junctions, but there are too many permutations to scale probably. Anyway, if
I use in-graph indexes, I have to update my indexes every insertion or
update. I'm going to try out indexed relationships, and I think it will
help, but with respect, I don't think it will scale well or fit my use
cases, especially for indexes where data drops out because the size is fixed
(like a fixed list).

I feel that creating index structures in the graph is nice, but it will
severely balloon the graph. Moreover, I want to save resources on the
servers running neo for graph traversals and other graph activities and I
would rather use other clustered servers to store huge amounts of index data
in memory. One other idea is to use another neo4j instance as an index to
itself, but I think the characteristics of what I am doing are better suited
in some cases for Redis (temporal lists) or Mongo (hierarchical metrics)
depending the use-case. Example: pulling down linear lists of time-data by
page and sorting front to back or back to front.

I know that's a lot, but I wanted to at least give some detail beyond what
I've already read here in all the old posts I've dug through this week. Any
feedback? Thanks.


--
View this message in context: 
http://neo4j-community-discussions.438527.n3.nabble.com/Creating-and-managing-external-index-tp3523613p3523613.html
Sent from the Neo4j Community Discussions mailing list archive at Nabble.com.
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user