+1 for optimizing only when necessary. Having said that, there will always be cases where sharding is necessary. Mostly a sensible domain-specific decision can be taken, but I was thinking about what Michael said about traversal statistics, and that is a great runtime idea beyond what I previously thought was the main criteria, which is network density.
I wondered if the caching engine actually does keep some transient statistics on traversals, and perhaps that could be persisted in some way, and mined later by algorithms interested in sharding the graph? Perhaps even just something coarse grained, like which relationships are 'hot', traversed frequently, or recently, or both. It might also help with algorithms designed for large graph visualization, helping to identify the islands, and the areas of interest, for clustering nodes into mega-nodes when drawing very large graphs? On Tue, Feb 22, 2011 at 8:52 AM, Mattias Persson <[email protected]>wrote: > 2011/2/22 J T <[email protected]> > > > Hmm, I hadn't considered the apache approach but it still kind of goes > > against the grain - perhaps i just want too much or its my innate > laziness > > ... hehe ;) > > > > Its not just about data size, its more about not wanting to have to > > re-engineer/re-factor as things grow - whether that growth is concurrent > > access or in data quantity. > > > > > There are not that many cases (fewer than you'd might imagine) where you'd > need to scale/shard out Neo4j to multiple machines just to handle the load > put on it. It's great to think ahead and be aware of limitations, but > there's a pretty high chance you just wont run into those. And if/when you > do Neo4j will probably have evolved to handle that load for you anyway, > maybe even sharding :) > > > > > > > > > On Mon, Feb 21, 2011 at 11:47 PM, Michael Hunger < > > [email protected]> wrote: > > > > > Hi J.T., > > > > > > of course you can have the cache sharding taken care of by the server > > side, > > > e.g. use an apache proxy for > > > client sticky routing, redirecting according to URL patterns etc. But > > that > > > doesn't cover your "domain". > > > > > > The problem is that other than simple kv stores, where the sharding the > > key > > > is pretty easy, sharding graphs is much more > > > demanding. You would like to have traversal locality (so that you don't > > > have to cross servers for a single traversal). > > > That means something that keeps (and also updates) your subgraphs to be > > in > > > just one server. > > > And deciding which subgraphs should be put together is either a pure > > domain > > > driven thing or something that could be achieved by having lots of > (long > > > running?) > > > clients (and their request URLs) and looking at their traversal / query > > > statistics and optimizing the data held permanently (or even > "mastered") > > on > > > the specific node for a certain set of requests. > > > > > > It would also mean that the occasional cross-server traversal should > > result > > > in local caches being updated for the remote data. > > > > > > Is the problem we're talking about just data size? You can already > store > > > pretty big graphs in a single neo4j node (esp. when you go for big > > > machines). > > > > > > Michael > > > > > > Am 22.02.2011 um 00:15 schrieb J T: > > > > > > > I realise that there are different qualities that can come in to play > > > with > > > > the labels 'scalability' & 'performance' and I can see how your > > strategy > > > > would help with some of those qualities but it relies on custom logic > > in > > > the > > > > client application to do the sharding and load spreading and doesn't > > > address > > > > scaling the underlying persistant storage engine. > > > > > > > > One of the things that attracted me to Riak and Cassandra (for the > use > > > cases > > > > I can apply them to) is that sharding, load balancing and persistance > > > > scaling was available out-of-the-box and and pretty much invisible to > > the > > > > client application. The client app didn't have to do anything > special. > > I > > > > appreciate that perhaps because they have different semantics that > its > > an > > > > easier for them to solve. > > > > > > > > I had a read of this page you wrote the other day : > > > > > > > > > > http://jim.webber.name/2011/02/16/3b8f4b3d-c884-4fba-ae6b-7b75a191fa22.aspx > > > > > > > > It was your comment "it's hard to achieve in practice" that prompted > me > > > to > > > > post my initial message yesterday to enquire further. > > > > > > > > I'm no specialist in the field, I just know what I want hehe :) > > > > > > > > The only player in the field I've been able to find that might have > > more > > > of > > > > the qualities I am interested is InfiniteGraph, its a shame that it > > > doesn't > > > > have a 'server' version like neo does for me to do a proper > comparison. > > > > > > > > I'll stick with neo for now, and see how the marketplace matures in > the > > > > coming months - i'm amazed at how much movement there has been in the > > > last > > > > year. > > > > > > > > > > > > > > > > > > > > On Mon, Feb 21, 2011 at 3:09 PM, Jim Webber <[email protected]> > > > wrote: > > > > > > > >> Yup, you nailed it better than I did Rick. > > > >> > > > >> Though your partition strategy might not be just "per user." For > > example > > > in > > > >> the geo domain, it makes sense to route requests for particular > cities > > > to > > > >> specific nodes. It'll depend on your application how you generate > your > > > >> routing rules. > > > >> > > > >> Jim > > > >> > > > >> On 21 Feb 2011, at 14:51, Michael Hunger wrote: > > > >> > > > >>> You shouldn't be confused because you got it right :) > > > >>> > > > >>> Cheers > > > >>> > > > >>> Michael > > > >>> > > > >>> Am 21.02.2011 um 15:40 schrieb Rick Otten: > > > >>> > > > >>>> Ok, I'm following this discussion, and now I'm confused. > > > >>>> > > > >>>> My understanding was that the (potentially very large) database is > > > >>>> replicated across all instances. > > > >>>> > > > >>>> If someone needed to traverse to something that wasn't cached, > > they'd > > > >> take > > > >>>> a performance hit, but still be able to get to it. > > > >>>> > > > >>>> I had understood the idea behind the load balancing is to minimize > > > >>>> traversals out of cache by grouping similar sets of users on a > > > >> particular > > > >>>> server. (That way you don't need a ton of RAM to stash everything > > in > > > >> the > > > >>>> database, just the most frequently accessed nodes and > relationships > > > >>>> associated with a subset of the users.) > > > >>>> > > > >>>> > > > >>>> > > > >>>> > > > >>>>> Hello JT, > > > >>>>> > > > >>>>>> One thing, when you say route requests to specific instances .. > > does > > > >>>>>> that > > > >>>>>> imply that node relationships can't span instances ? > > > >>>>> > > > >>>>> Yes that's right. What I'm suggesting here is that each instance > is > > a > > > >> full > > > >>>>> replica that works on a subset of requests which are likely to > keep > > > the > > > >>>>> caches warm. > > > >>>>> > > > >>>>> So if you can split your requests (e.g all customers beginning > with > > > "A" > > > >> go > > > >>>>> to instance "1" ... all customers beginning with "Z" go to > instance > > > >> "26"), > > > >>>>> they will benefit from having warm caches for reading, while the > HA > > > >>>>> infrastructure deals with updates across instances > transactionally. > > > >>>>> > > > >>>>> Jim > > > >>>>> _______________________________________________ > > > >>>>> Neo4j mailing list > > > >>>>> [email protected] > > > >>>>> https://lists.neo4j.org/mailman/listinfo/user > > > >>>>> > > > >>>> > > > >>>> > > > >>>> -- > > > >>>> Rick Otten > > > >>>> [email protected] > > > >>>> O=='=+ > > > >>>> > > > >>>> > > > >>>> _______________________________________________ > > > >>>> Neo4j mailing list > > > >>>> [email protected] > > > >>>> https://lists.neo4j.org/mailman/listinfo/user > > > >>> > > > >>> _______________________________________________ > > > >>> Neo4j mailing list > > > >>> [email protected] > > > >>> https://lists.neo4j.org/mailman/listinfo/user > > > >> > > > >> _______________________________________________ > > > >> Neo4j mailing list > > > >> [email protected] > > > >> https://lists.neo4j.org/mailman/listinfo/user > > > >> > > > > _______________________________________________ > > > > Neo4j mailing list > > > > [email protected] > > > > https://lists.neo4j.org/mailman/listinfo/user > > > > > > _______________________________________________ > > > Neo4j mailing list > > > [email protected] > > > https://lists.neo4j.org/mailman/listinfo/user > > > > > _______________________________________________ > > Neo4j mailing list > > [email protected] > > https://lists.neo4j.org/mailman/listinfo/user > > > > > > -- > Mattias Persson, [[email protected]] > Hacker, Neo Technology > www.neotechnology.com > _______________________________________________ > Neo4j mailing list > [email protected] > https://lists.neo4j.org/mailman/listinfo/user > _______________________________________________ Neo4j mailing list [email protected] https://lists.neo4j.org/mailman/listinfo/user

