Re: [Neo4j] Question from Webinar - traversing a path with nodes of different types
Hi Vipul, Zooming out a little bit, what are the inputs to your algorithm, and what do you want it to do? For example, given 1 and 6, do you want to find any points in the chain between them that are join points of two (or more) subchains (5 in this case)? David On Wed, Apr 20, 2011 at 10:56 PM, Vipul Gupta vipulgupta...@gmail.comwrote: my mistake - I meant 5 depends on both 3 and 8 and acts as a blocking point till 3 and 8 finishes On Thu, Apr 21, 2011 at 11:19 AM, Vipul Gupta vipulgupta...@gmail.comwrote: David/Michael, Let me modify the example a bit. What if my graph structure is like this domain.Client@1 - domain.Router@2 - domain.Router@3 - domain.Router@5- domain.Server@6 - domain.Router@7 - domain.Router@8 - Imagine a manufacturing line. 6 depends on both 3 and 8 and acts as a blocking point till 3 and 8 finishes. Is there a way to get a cleaner traversal for such kind of relationship. I want to get a complete intermediate traversal from Client to Server. Thank a lot for helping out on this. Best Regards, Vipul On Thu, Apr 21, 2011 at 12:09 AM, David Montag david.mon...@neotechnology.com wrote: Hi Vipul, Thanks for listening! It's a very good question, and the short answer is: yes! I'm cc'ing our mailing list so that everyone can take part in the answer. Here's the long answer, illustrated by an example: Let's assume you're modeling a network. You'll have some domain classes that are all networked entities with peers: @NodeEntity public class NetworkEntity { @RelatedTo(type = PEER, direction = Direction.BOTH, elementClass = NetworkEntity.class) private SetNetworkEntity peers; public void addPeer(NetworkEntity peer) { peers.add(peer); } } public class Server extends NetworkEntity {} public class Router extends NetworkEntity {} public class Client extends NetworkEntity {} Then we can build a small network: Client c = new Client().persist(); Router r1 = new Router().persist(); Router r21 = new Router().persist(); Router r22 = new Router().persist(); Router r3 = new Router().persist(); Server s = new Server().persist(); c.addPeer(r1); r1.addPeer(r21); r1.addPeer(r22); r21.addPeer(r3); r22.addPeer(r3); r3.addPeer(s); c.persist(); Note that after linking the entities, I only call persist() on the client. You can read more about this in the reference documentation, but essentially it will cascade in the direction of the relationships created, and will in this case cascade all the way to the server entity. You can now query this: IterableEntityPathClient, Server paths = c.findAllPathsByTraversal(Traversal.description()); The above code will get you an EntityPath per node visited during the traversal from c. The example does however not use a very interesting traversal description, but you can still print the results: for (EntityPathClient, Server path : paths) { StringBuilder sb = new StringBuilder(); IteratorNetworkEntity iter = path.NetworkEntitynodeEntities().iterator(); while (iter.hasNext()) { sb.append(iter.next()); if (iter.hasNext()) sb.append( - ); } System.out.println(sb); } This will print each path, with all entities in the path. This is what it looks like: domain.Client@1 domain.Client@1 - domain.Router@2 domain.Client@1 - domain.Router@2 - domain.Router@3 domain.Client@1 - domain.Router@2 - domain.Router@3 - domain.Router@5 domain.Client@1 - domain.Router@2 - domain.Router@3 - domain.Router@5 - domain.Server@6 domain.Client@1 - domain.Router@2 - domain.Router@3 - domain.Router@5 - domain.Router@4 Let us know if this is what you looked for. If you want to only find paths that end with a server, you'd use this query instead: IterableEntityPathClient, Server paths = c.findAllPathsByTraversal(Traversal.description().evaluator(new Evaluator() { @Override public Evaluation evaluate(Path path) { if (new ConvertingEntityPath(graphDatabaseContext, path).endEntity() instanceof Server) { return Evaluation.INCLUDE_AND_PRUNE; } return Evaluation.EXCLUDE_AND_CONTINUE; } })); In the above code example, graphDatabaseContext is a bean of type GraphDatabaseContext created by Spring Data Graph. This syntax will dramatically improve in future releases. It will print: domain.Client@1 - domain.Router@2 - domain.Router@3 - domain.Router@5 - domain.Server@6 Regarding your second question about types: If you want to convert a node into an entity, you would use the TypeRepresentationStrategy configured internally in Spring Data Graph. See the reference documentation for more information on this. If you want to convert Neo4j paths to entity paths, you can use the ConvertingEntityPath class as seen above. As an implementation detail, the class name is stored on the node as a property. Hope this helped!
Re: [Neo4j] Question from Webinar - traversing a path with nodes of different types
Hi David, Inputs are 1 and 6 + Graph is acyclic. domain.Client@1 - domain.Router@2 - domain.Router@3 - domain.Router@5 - domain.Server@6 - domain.Router@7 - domain.Router@8 - I want a way to start from 1, process the 2 path till it reaches 5 (say in a thread) process the 7 path till it reaches 5 (in another thread) then process 5 and eventually 6. the above step of processing intermediate path and waiting on the blocking point can happen over and over again in a more complex graph (that is there could be a number of loops in between even) and the traversal stops only we reach 6 I hope this makes it a bit clear. I was working out something for this, but it is turning out to be too complex a solution for this sort of traversal of a graph, so I am hoping if you can suggest something. Best Regards, Vipul On Thu, Apr 21, 2011 at 11:36 AM, David Montag david.mon...@neotechnology.com wrote: Hi Vipul, Zooming out a little bit, what are the inputs to your algorithm, and what do you want it to do? For example, given 1 and 6, do you want to find any points in the chain between them that are join points of two (or more) subchains (5 in this case)? David On Wed, Apr 20, 2011 at 10:56 PM, Vipul Gupta vipulgupta...@gmail.comwrote: my mistake - I meant 5 depends on both 3 and 8 and acts as a blocking point till 3 and 8 finishes On Thu, Apr 21, 2011 at 11:19 AM, Vipul Gupta vipulgupta...@gmail.comwrote: David/Michael, Let me modify the example a bit. What if my graph structure is like this domain.Client@1 - domain.Router@2 - domain.Router@3 - domain.Router@5- domain.Server@6 - domain.Router@7 - domain.Router@8 - Imagine a manufacturing line. 6 depends on both 3 and 8 and acts as a blocking point till 3 and 8 finishes. Is there a way to get a cleaner traversal for such kind of relationship. I want to get a complete intermediate traversal from Client to Server. Thank a lot for helping out on this. Best Regards, Vipul On Thu, Apr 21, 2011 at 12:09 AM, David Montag david.mon...@neotechnology.com wrote: Hi Vipul, Thanks for listening! It's a very good question, and the short answer is: yes! I'm cc'ing our mailing list so that everyone can take part in the answer. Here's the long answer, illustrated by an example: Let's assume you're modeling a network. You'll have some domain classes that are all networked entities with peers: @NodeEntity public class NetworkEntity { @RelatedTo(type = PEER, direction = Direction.BOTH, elementClass = NetworkEntity.class) private SetNetworkEntity peers; public void addPeer(NetworkEntity peer) { peers.add(peer); } } public class Server extends NetworkEntity {} public class Router extends NetworkEntity {} public class Client extends NetworkEntity {} Then we can build a small network: Client c = new Client().persist(); Router r1 = new Router().persist(); Router r21 = new Router().persist(); Router r22 = new Router().persist(); Router r3 = new Router().persist(); Server s = new Server().persist(); c.addPeer(r1); r1.addPeer(r21); r1.addPeer(r22); r21.addPeer(r3); r22.addPeer(r3); r3.addPeer(s); c.persist(); Note that after linking the entities, I only call persist() on the client. You can read more about this in the reference documentation, but essentially it will cascade in the direction of the relationships created, and will in this case cascade all the way to the server entity. You can now query this: IterableEntityPathClient, Server paths = c.findAllPathsByTraversal(Traversal.description()); The above code will get you an EntityPath per node visited during the traversal from c. The example does however not use a very interesting traversal description, but you can still print the results: for (EntityPathClient, Server path : paths) { StringBuilder sb = new StringBuilder(); IteratorNetworkEntity iter = path.NetworkEntitynodeEntities().iterator(); while (iter.hasNext()) { sb.append(iter.next()); if (iter.hasNext()) sb.append( - ); } System.out.println(sb); } This will print each path, with all entities in the path. This is what it looks like: domain.Client@1 domain.Client@1 - domain.Router@2 domain.Client@1 - domain.Router@2 - domain.Router@3 domain.Client@1 - domain.Router@2 - domain.Router@3 - domain.Router@5 domain.Client@1 - domain.Router@2 - domain.Router@3 - domain.Router@5 - domain.Server@6 domain.Client@1 - domain.Router@2 - domain.Router@3 - domain.Router@5 - domain.Router@4 Let us know if this is what you looked for. If you want to only find paths that end with a server, you'd use this query instead: IterableEntityPathClient, Server paths = c.findAllPathsByTraversal(Traversal.description().evaluator(new Evaluator() { @Override public Evaluation evaluate(Path path) { if (new
Re: [Neo4j] WebCrawler-Data in Neo4j
Hi Marc, 2011/4/19 Marc Seeger m...@marc-seeger.de: Hey, I'm currently thinking about how my current data (in mysql + solr) would fit into Neo4j. In one of my documents, there are the 3 types of data I have: 1. Properties that have high cardinality: e.g. the domain name (www.example.org, unique), the subdomain name (www.), the host-name (example) 2. A bunch of numbers (the website latency (1244ms), the amount of incoming links (e.g. 2321)) 3. A number of 'tags' that have a relatively low cardinality (100). Things like the webserver (apache), the country (germany) As for the model, I think it would be something like this: - Every domain gets a node - #1 would be modeled as a property on the domain node - #2 would probably be put into a lucene index so I can sort on it later on - #3 could be modeled using relations. E.g. a node that has two properties: type:webserver and name:apache. All of the domain-nodes can have a relation called runs on the webserver Does this make sense? I am used to Document DBs, relational DBs and Column Stores, but Graph DBs are still pretty new to me and I don't think I got the model 100% :) Using this model, would I be able to filter subsets of the data such as All Domains that run on apache and are in Germany and have more than 200 incoming links sorted by the amount of links? Even every subdomain and tag could be a node: (www) --SUBDOMAIN_OF-- (example.org) --RUNS_ON-- (apache) \ ---RUNS_IN-- (germany) You could then start from the apache or germany node: Node apache = ... Node germany = ... for ( Relationship runsIn : germany.getRelationships( RUNS_IN, INCOMING ) ) { Node domain = runsIn.getStartNode(); if ( apache.equals( domain.getSingleRelationship( RUNS_ON, OUTGOING ) ) { int incomingLinks = (Integer) domain.getProperty( links ); if ( incomingLinks 200 ) // This is a hit, store in a list } } // sort the result list Or the other way around (start from number of links, via a sorted lucene lookup). Sorry for the quite verbose lucene query code: Node apache = ... Node germany = ... Query rangeQuery = NumericRangeQuery.newIntRange( links, 0, 200, true, false ); QueryContext query = new QueryContext( rangeQuery ).sort( new Sort( new SortField( links, SortField.LONG ) ) ); for ( Node domain : domainIndex.query( query ) ) { if ( apache.equals( domain.getSingleRelationship( RUNS_ON, OUTGOING ) ) germany.equals( domain.getSingleRelationship( RUNS_IN, OUTGOING ) ) ) // This is a hit } If performance becomes a problem then I'd guess you'll have to index more fields (links, webserver, country) into the same index so that compound queries can be asked. I played a bit arround with the neography gem in Ruby and I could do stuff like: germany_nginx = germany_nodel.shortest_path_to(websrv_nginx).depth(2).nodes But I couldn't figure out how to expand this query Looking forward to the feedback! Marc -- Pessimists, we're told, look at a glass containing 50% air and 50% water and see it as half empty. Optimists, in contrast, see it as half full. Engineers, of course, understand the glass is twice as big as it needs to be. (Bob Lewis) ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user -- Mattias Persson, [matt...@neotechnology.com] Hacker, Neo Technology www.neotechnology.com ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] Error building Neo4j
We're successfully building it with Maven 2. /anders On 04/21/2011 04:15 AM, Kevin Moore wrote: I've tried 1.3 tag, master, etc. Always the same error. Maven 3.0.2 Should I be using a different version? [INFO] Unpacking /Users/kevin/source/github/neo4j/graph-algo/target/classes to /Users/kevin/source/github/neo4j/neo4j/target/sources with includes null and excludes:null org.codehaus.plexus.archiver.ArchiverException: The source must not be a directory. at org.codehaus.plexus.archiver.AbstractUnArchiver.validate(AbstractUnArchiver.java:174) at org.codehaus.plexus.archiver.AbstractUnArchiver.extract(AbstractUnArchiver.java:107) at org.apache.maven.plugin.dependency.AbstractDependencyMojo.unpack(AbstractDependencyMojo.java:260) at org.apache.maven.plugin.dependency.UnpackDependenciesMojo.execute(UnpackDependenciesMojo.java:90) at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:107) at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:209) at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153) at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145) at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:84) at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:59) at org.apache.maven.lifecycle.internal.LifecycleStarter.singleThreadedBuild(LifecycleStarter.java:183) at org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:161) at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:319) at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:156) at org.apache.maven.cli.MavenCli.execute(MavenCli.java:534) at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:196) at org.apache.maven.cli.MavenCli.main(MavenCli.java:141) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:290) at org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:230) at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:409) at org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:352) [INFO] [INFO] Reactor Summary: [INFO] [INFO] Neo4j - Graph Database Kernel . SUCCESS [3:53.536s] [INFO] Neo4j - JMX support ... SUCCESS [1.291s] [INFO] Neo4j - Usage Data Collection . SUCCESS [13.238s] [INFO] Neo4j - Lucene Index .. SUCCESS [5.020s] [INFO] Neo4j - Graph Algorithms .. SUCCESS [0.204s] [INFO] Neo4j . FAILURE [1:16.071s] [INFO] Neo4j Community ... SKIPPED [INFO] Neo4j - Generic shell . SKIPPED [INFO] Neo4j Examples SKIPPED [INFO] Neo4j Server API .. SKIPPED [INFO] Neo4j Server .. SKIPPED [INFO] Neo4j Server Examples . SKIPPED [INFO] Neo4j Community Build . SKIPPED [INFO] [INFO] BUILD FAILURE [INFO] [INFO] Total time: 6:57.812s [INFO] Finished at: Wed Apr 20 18:58:58 PDT 2011 [INFO] Final Memory: 17M/81M [INFO] [ERROR] Failed to execute goal org.apache.maven.plugins:maven-dependency-plugin:2.1:unpack-dependencies (get-sources) on project neo4j: Error unpacking file: /Users/kevin/source/github/neo4j/graph-algo/target/classes to: /Users/kevin/source/github/neo4j/neo4j/target/sources [ERROR] org.codehaus.plexus.archiver.ArchiverException: The source must not be a directory. [ERROR] - [Help 1] [ERROR] [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch. [ERROR] Re-run Maven using the -X switch to enable full debug logging. [ERROR] [ERROR] For more information about the errors and possible solutions, please read the following articles: [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException [ERROR] [ERROR] After correcting the problems, you can resume the build with the command [ERROR] mvngoals -rf :neo4j ___ Neo4j mailing list
Re: [Neo4j] Error building Neo4j
Hi Kevin, I can replicate your problem. The way I worked around this was to use Maven 2.2.1 rather than Maven 3.0.x. Then I get a green build for community edition. I'll poke the devteam and see what Maven versions they're running on. Jim ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] Question from Webinar - traversing a path with nodes of different types
Sounds like a simulation/operations research application. The graph database will be suitable for modeling the entities and their characteristics (transfer times = properties on relationships, setup and services times = properties on nodes, queue sizes, etc) but I think you'll need a layer on top of the traversal framework for managing the overall simulation logic. - Reply message - From: Vipul Gupta vipulgupta...@gmail.com Date: Thu, Apr 21, 2011 2:16 am Subject: [Neo4j] Question from Webinar - traversing a path with nodes of different types To: David Montag david.mon...@neotechnology.com Cc: UserList user@lists.neo4j.org, michael.hun...@neotechnology.com michael.hun...@neotechnology.com Hi David, Inputs are 1 and 6 + Graph is acyclic. domain.Client@1 - domain.Router@2 - domain.Router@3 - domain.Router@5 - domain.Server@6 - domain.Router@7 - domain.Router@8 - I want a way to start from 1, process the 2 path till it reaches 5 (say in a thread) process the 7 path till it reaches 5 (in another thread) then process 5 and eventually 6. the above step of processing intermediate path and waiting on the blocking point can happen over and over again in a more complex graph (that is there could be a number of loops in between even) and the traversal stops only we reach 6 I hope this makes it a bit clear. I was working out something for this, but it is turning out to be too complex a solution for this sort of traversal of a graph, so I am hoping if you can suggest something. Best Regards, Vipul On Thu, Apr 21, 2011 at 11:36 AM, David Montag david.mon...@neotechnology.com wrote: Hi Vipul, Zooming out a little bit, what are the inputs to your algorithm, and what do you want it to do? For example, given 1 and 6, do you want to find any points in the chain between them that are join points of two (or more) subchains (5 in this case)? David On Wed, Apr 20, 2011 at 10:56 PM, Vipul Gupta vipulgupta...@gmail.comwrote: my mistake - I meant 5 depends on both 3 and 8 and acts as a blocking point till 3 and 8 finishes On Thu, Apr 21, 2011 at 11:19 AM, Vipul Gupta vipulgupta...@gmail.comwrote: David/Michael, Let me modify the example a bit. What if my graph structure is like this domain.Client@1 - domain.Router@2 - domain.Router@3 - domain.Router@5- domain.Server@6 - domain.Router@7 - domain.Router@8 - Imagine a manufacturing line. 6 depends on both 3 and 8 and acts as a blocking point till 3 and 8 finishes. Is there a way to get a cleaner traversal for such kind of relationship. I want to get a complete intermediate traversal from Client to Server. Thank a lot for helping out on this. Best Regards, Vipul On Thu, Apr 21, 2011 at 12:09 AM, David Montag david.mon...@neotechnology.com wrote: Hi Vipul, Thanks for listening! It's a very good question, and the short answer is: yes! I'm cc'ing our mailing list so that everyone can take part in the answer. Here's the long answer, illustrated by an example: Let's assume you're modeling a network. You'll have some domain classes that are all networked entities with peers: @NodeEntity public class NetworkEntity { @RelatedTo(type = PEER, direction = Direction.BOTH, elementClass = NetworkEntity.class) private SetNetworkEntity peers; public void addPeer(NetworkEntity peer) { peers.add(peer); } } public class Server extends NetworkEntity {} public class Router extends NetworkEntity {} public class Client extends NetworkEntity {} Then we can build a small network: Client c = new Client().persist(); Router r1 = new Router().persist(); Router r21 = new Router().persist(); Router r22 = new Router().persist(); Router r3 = new Router().persist(); Server s = new Server().persist(); c.addPeer(r1); r1.addPeer(r21); r1.addPeer(r22); r21.addPeer(r3); r22.addPeer(r3); r3.addPeer(s); c.persist(); Note that after linking the entities, I only call persist() on the client. You can read more about this in the reference documentation, but essentially it will cascade in the direction of the relationships created, and will in this case cascade all the way to the server entity. You can now query this: IterableEntityPathClient, Server paths = c.findAllPathsByTraversal(Traversal.description()); The above code will get you an EntityPath per node visited during the traversal from c. The example does however not use a very interesting traversal description, but you can still print the results: for (EntityPathClient, Server path : paths) { StringBuilder sb = new StringBuilder(); IteratorNetworkEntity iter = path.NetworkEntitynodeEntities().iterator(); while (iter.hasNext()) { sb.append(iter.next()); if (iter.hasNext()) sb.append( - ); } System.out.println(sb); } This will print each path, with all entities in the path. This is what it looks like:
Re: [Neo4j] REST results pagination
Legacy application that just uses a new data source. It can be quite hard to get users away from their trusty old-chap-UI. In the case of Pagination, Legacy might only mean some years but still legacy :-). @1-2) In the wake of mobile applications and mobile sites a pagination system might be more relevant than bulk loading everything and displaying it. defining smart filters might be problematic in such a use case as well. Parallelism of an application could also be a interesting aspect. Each worker retrieves the different pages of the graph and the user does not have to care at all about separating the graph after downloading it. This would only be interesting though if the graph relations are not important. Georg On 21 April 2011 14:59, Rick Bullotta rick.bullo...@thingworx.com wrote: Fwiw, I think paging is an outdated crutch, for a few reasons: 1) bandwidth and browser processing/parsing are largely non issues, although they used to be 2) human users rarely have the patience (and usability sucks) to go beyond 2-4 pages of information. It is far better to allow incrementally refined filters and searches to get to a workable subset of data. 3) machine users could care less about paging 4) when doing visualization of a large dataset, you generally want the whole dataset, not a page of it, so that's another non use case Discuss and debate please! Rick - Reply message - From: Craig Taverner cr...@amanzi.com Date: Thu, Apr 21, 2011 8:52 am Subject: [Neo4j] REST results pagination To: Neo4j user discussions user@lists.neo4j.org I assume this: Traverser x = Traversal.description().traverse( someNode ); x.nodes(); x.nodes(); // Not necessarily in the same order as previous call. If that assumption is false or there is some workaround, then I agree that this is a valid approach, and a good efficient alternative when sorting is not relevant. Glancing at the code in TraverserImpl though, it really looks like the call to .nodes will re-run the traversal, and I thought that would mean the two calls can yield results in different order? OK. My assumptions were different. I assume that while the order is not easily predictable, it is reproducable as long as the underlying graph has not changed. If the graph changes, then the order can change also. But I think this is true of a relational database also, is it not? So, obviously pagination is expected (by me at least) to give page X as it is at the time of the request for page X, not at the time of the request for page 1. But my assumptions could be incorrect too... I understand, and completely agree. My problem with the approach is that I think its harder than it looks at first glance. I guess I cannot argue that point. My original email said I did not know if this idea had been solved yet. Since some of the key people involved in this have not chipped into this discussion, either we are reasonably correct in our ideas, or so wrong that they don't know where to begin correcting us ;-) This is what makes me push for the sorted approach - relational databases are doing this. I don't know how they do it, but they are, and we should be at least as good. Absolutely. We should be as good. Relational database manage to serve a page deep down the list quite fast. I must believe if they had to complete the traversal, sort the results and extract the page on every single page request, they could not be so fast. I think my ideas for the traversal are 'supposed' to be performance enhancements, and that is why I like them ;-) I agree the issue of what should be indexed to optimize sorting is a domain-specific problem, but I think that is how relational databases treat it as well. If you want sorting to be fast, you have to tell them to index the field you will be sorting on. The only difference contra having the user put the sorting index in the graph is that relational databases will handle the indexing for you, saving you a *ton* of work, and I think we should too. Yes. I was discussing automatic indexing with Mattias recently. I think (and hope I am right), that once we move to automatic indexes, then it will be possible to put external indexes (a'la lucene) and graph indexes (like the ones I favour) behind the same API. In this case perhaps the database will more easily be able to make the right optimized decisions, and use the index for providing sorted results fast and with low memory footprint where possible, based on the existance or non-existance of the necessary indices. Then all the developer needs to do to make things really fast is put in the right index. For some data, that would be lucene and for others it would be a graph index. If we get to this point, I think we will have closed a key usability gap with relational databases. There are cases where you need to add this sort of meta data to your domain model,
Re: [Neo4j] REST results pagination
3) machine users could care less about paging My thoughts are that parsing very large documents can perform poorly and requires the entire document be slurped into (available) RAM. This puts a cap on the size of a usable resultset and slows processing, or at least makes you pay an up-front cost, and decreases potential for parallelism in other parts of your app?. ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] REST results pagination
Fwiw, we use an idiot resistant (no such thing as idiot proof) approach that clamps the number of returned items on the server side by default. We allow the user to explicitly request to do something foolish and ask for more data, but it requires a conscious effort. - Reply message - From: Jacob Hansson ja...@voltvoodoo.com Date: Thu, Apr 21, 2011 10:06 am Subject: [Neo4j] REST results pagination To: Neo4j user discussions user@lists.neo4j.org On Thu, Apr 21, 2011 at 2:59 PM, Rick Bullotta rick.bullo...@thingworx.comwrote: Fwiw, I think paging is an outdated crutch, for a few reasons: 1) bandwidth and browser processing/parsing are largely non issues, although they used to be I disagree. They have improved significantly, for sure, but that is no reason to download massive amounts of data that will never be used. 2) human users rarely have the patience (and usability sucks) to go beyond 2-4 pages of information. It is far better to allow incrementally refined filters and searches to get to a workable subset of data. I agree with the suckiness of paging and the awesomeness of filtering - but what do you do when the users filter returns 40 million results? You somehow have to tell the user that damn, that filter, it returned forty freaking million results, you need to refine your search buddy. The way the user expects that to happen is through presenting a paged, infinite scrolled or similar interface, where she can see how many results where returned and act on that feedback. 3) machine users could care less about paging Agreed, streaming is a much better way for machines to talk about data that doesn't fit in memory. 4) when doing visualization of a large dataset, you generally want the whole dataset, not a page of it, so that's another non use case Not necessarily true. You need all the data that you want to visualize, but that is not necessarily all the data the user has asked for. You can be clever about the visualization to keep it uncluttered, and paging-like behaviours may be a way to do that. Discuss and debate please! Rick - Reply message - From: Craig Taverner cr...@amanzi.com Date: Thu, Apr 21, 2011 8:52 am Subject: [Neo4j] REST results pagination To: Neo4j user discussions user@lists.neo4j.org I assume this: Traverser x = Traversal.description().traverse( someNode ); x.nodes(); x.nodes(); // Not necessarily in the same order as previous call. If that assumption is false or there is some workaround, then I agree that this is a valid approach, and a good efficient alternative when sorting is not relevant. Glancing at the code in TraverserImpl though, it really looks like the call to .nodes will re-run the traversal, and I thought that would mean the two calls can yield results in different order? OK. My assumptions were different. I assume that while the order is not easily predictable, it is reproducable as long as the underlying graph has not changed. If the graph changes, then the order can change also. But I think this is true of a relational database also, is it not? So, obviously pagination is expected (by me at least) to give page X as it is at the time of the request for page X, not at the time of the request for page 1. But my assumptions could be incorrect too... I understand, and completely agree. My problem with the approach is that I think its harder than it looks at first glance. I guess I cannot argue that point. My original email said I did not know if this idea had been solved yet. Since some of the key people involved in this have not chipped into this discussion, either we are reasonably correct in our ideas, or so wrong that they don't know where to begin correcting us ;-) This is what makes me push for the sorted approach - relational databases are doing this. I don't know how they do it, but they are, and we should be at least as good. Absolutely. We should be as good. Relational database manage to serve a page deep down the list quite fast. I must believe if they had to complete the traversal, sort the results and extract the page on every single page request, they could not be so fast. I think my ideas for the traversal are 'supposed' to be performance enhancements, and that is why I like them ;-) I agree the issue of what should be indexed to optimize sorting is a domain-specific problem, but I think that is how relational databases treat it as well. If you want sorting to be fast, you have to tell them to index the field you will be sorting on. The only difference contra having the user put the sorting index in the graph is that relational databases will handle the indexing for you, saving you a *ton* of work, and I think we should too. Yes. I was discussing automatic indexing with Mattias recently. I think (and hope I am right), that once we move to automatic indexes, then it will be possible to put
Re: [Neo4j] REST results pagination
Good dialog, btw! - Reply message - From: Jacob Hansson ja...@voltvoodoo.com Date: Thu, Apr 21, 2011 10:06 am Subject: [Neo4j] REST results pagination To: Neo4j user discussions user@lists.neo4j.org On Thu, Apr 21, 2011 at 2:59 PM, Rick Bullotta rick.bullo...@thingworx.comwrote: Fwiw, I think paging is an outdated crutch, for a few reasons: 1) bandwidth and browser processing/parsing are largely non issues, although they used to be I disagree. They have improved significantly, for sure, but that is no reason to download massive amounts of data that will never be used. 2) human users rarely have the patience (and usability sucks) to go beyond 2-4 pages of information. It is far better to allow incrementally refined filters and searches to get to a workable subset of data. I agree with the suckiness of paging and the awesomeness of filtering - but what do you do when the users filter returns 40 million results? You somehow have to tell the user that damn, that filter, it returned forty freaking million results, you need to refine your search buddy. The way the user expects that to happen is through presenting a paged, infinite scrolled or similar interface, where she can see how many results where returned and act on that feedback. 3) machine users could care less about paging Agreed, streaming is a much better way for machines to talk about data that doesn't fit in memory. 4) when doing visualization of a large dataset, you generally want the whole dataset, not a page of it, so that's another non use case Not necessarily true. You need all the data that you want to visualize, but that is not necessarily all the data the user has asked for. You can be clever about the visualization to keep it uncluttered, and paging-like behaviours may be a way to do that. Discuss and debate please! Rick - Reply message - From: Craig Taverner cr...@amanzi.com Date: Thu, Apr 21, 2011 8:52 am Subject: [Neo4j] REST results pagination To: Neo4j user discussions user@lists.neo4j.org I assume this: Traverser x = Traversal.description().traverse( someNode ); x.nodes(); x.nodes(); // Not necessarily in the same order as previous call. If that assumption is false or there is some workaround, then I agree that this is a valid approach, and a good efficient alternative when sorting is not relevant. Glancing at the code in TraverserImpl though, it really looks like the call to .nodes will re-run the traversal, and I thought that would mean the two calls can yield results in different order? OK. My assumptions were different. I assume that while the order is not easily predictable, it is reproducable as long as the underlying graph has not changed. If the graph changes, then the order can change also. But I think this is true of a relational database also, is it not? So, obviously pagination is expected (by me at least) to give page X as it is at the time of the request for page X, not at the time of the request for page 1. But my assumptions could be incorrect too... I understand, and completely agree. My problem with the approach is that I think its harder than it looks at first glance. I guess I cannot argue that point. My original email said I did not know if this idea had been solved yet. Since some of the key people involved in this have not chipped into this discussion, either we are reasonably correct in our ideas, or so wrong that they don't know where to begin correcting us ;-) This is what makes me push for the sorted approach - relational databases are doing this. I don't know how they do it, but they are, and we should be at least as good. Absolutely. We should be as good. Relational database manage to serve a page deep down the list quite fast. I must believe if they had to complete the traversal, sort the results and extract the page on every single page request, they could not be so fast. I think my ideas for the traversal are 'supposed' to be performance enhancements, and that is why I like them ;-) I agree the issue of what should be indexed to optimize sorting is a domain-specific problem, but I think that is how relational databases treat it as well. If you want sorting to be fast, you have to tell them to index the field you will be sorting on. The only difference contra having the user put the sorting index in the graph is that relational databases will handle the indexing for you, saving you a *ton* of work, and I think we should too. Yes. I was discussing automatic indexing with Mattias recently. I think (and hope I am right), that once we move to automatic indexes, then it will be possible to put external indexes (a'la lucene) and graph indexes (like the ones I favour) behind the same API. In this case perhaps the database will more easily be able to make the right optimized decisions, and use the index for providing sorted results fast and
Re: [Neo4j] REST results pagination
This is indeed a good dialogue. The pagination versus streaming was something I'd previously had in my mind as orthogonal issues, but I like the direction this is going. Let's break it down to fundamentals: As a remote client, I want to be just as rich and performant as a local client. Unfortunately, Deutsch, Amdahl and Einstein are against me on that, and I don't think I am tough enough to defeat those guys. So what are my choices? I know I have to be more granular to try to alleviate some of the network penalty so doing operations bulkily sounds great. Now what I need to decide is whether I control the rate at which those bulk operations occur or whether the server does. If I want to control those operations, then paging seems sensible. Otherwise a streamed (chunked) encoding scheme would make sense if I'm happy for the server to throw results back at me at its own pace. Or indeed you can mix both so that pages are streamed. In either case if I get bored of those results, I'll stop paging or I'll terminate the connection. So what does this mean for implementation on the server? I guess this is important since it affects the likelihood of the Neo Tech team implementing it. If the server supports pagination, it means we need a paging controller in memory per paginated result set being created. If we assume that we'll only go forward in pages, that's effectively just a wrapper around the traversal that's been uploaded. The overhead should be modest, and apart from the paging controller and the traverser, it doesn't need much state. We would need to add some logic to the representation code to support next links, but that seems a modest task. If the server streams, we will need to decouple the representation generation from the existing representation logic since that builds an in-memory representation which is then flushed. Instead we'll need a streaming representation implementation which seems to be a reasonable amount of engineering. We'll also need a new streaming binding to the REST server in JAX-RS land. I'm still a bit concerned about how rude it is for a client to just drop a streaming connection. I've asked Mark Nottingham for his authoritative opinion on that. But still, this does seem popular and feasible. Jim ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] REST results pagination
I can only think of a few use cases where loosing some of the expected result is ok, for instance if you want to peek at the result. IMHO, paging is, by definition, a peek. Since the client controls when the next page will be requested, it is not possible, or reasonable, to enforce that the complete set of pages (if every requested) will represent a consistent result set. This is not supported by relational databases either. The result set, and meaning of a page, can change between requests. So it can, and does happen, data some of the expected result is lost. This is completely different to the streaming result, which I see Jim commented on, and so I might just reply to his mail too :-) I'm waiting for one of those SlapOnTheFingersExceptions' that Tobias has been handing out :) My fingers are, as yet, unscathed. The slap can come at any moment! :-) This sounds really cool, would be a great thing to look into! Should you want examples, I have a wiki page on this topic at http://redmine.amanzi.org/wiki/geoptima/Geoptima_Event_Log http://redmine.amanzi.org/wiki/geoptima/Geoptima_Event_Log ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] REST results pagination
I think Jim makes a great point about the differences between paging and streaming, being client or server controlled. I think there is a related point to be made, and that is that paging does not, and cannot, guarantee a consistent total result set. Since the database can change between pages requests, they can be inconsistent. It is possible for the same record to appear in two pages, or for a record to be missed. This is certainly how relational databases work in this regard. But in the streaming case, we expect a complete and consistent result set. Unless, of course, the client cuts off the stream. The use case is very different, while paging is about getting a peek at the data, and rarely about paging all the way to the end, streaming is about getting the entire result, but streamed for efficiency. On Thu, Apr 21, 2011 at 5:00 PM, Jim Webber j...@neotechnology.com wrote: This is indeed a good dialogue. The pagination versus streaming was something I'd previously had in my mind as orthogonal issues, but I like the direction this is going. Let's break it down to fundamentals: As a remote client, I want to be just as rich and performant as a local client. Unfortunately, Deutsch, Amdahl and Einstein are against me on that, and I don't think I am tough enough to defeat those guys. So what are my choices? I know I have to be more granular to try to alleviate some of the network penalty so doing operations bulkily sounds great. Now what I need to decide is whether I control the rate at which those bulk operations occur or whether the server does. If I want to control those operations, then paging seems sensible. Otherwise a streamed (chunked) encoding scheme would make sense if I'm happy for the server to throw results back at me at its own pace. Or indeed you can mix both so that pages are streamed. In either case if I get bored of those results, I'll stop paging or I'll terminate the connection. So what does this mean for implementation on the server? I guess this is important since it affects the likelihood of the Neo Tech team implementing it. If the server supports pagination, it means we need a paging controller in memory per paginated result set being created. If we assume that we'll only go forward in pages, that's effectively just a wrapper around the traversal that's been uploaded. The overhead should be modest, and apart from the paging controller and the traverser, it doesn't need much state. We would need to add some logic to the representation code to support next links, but that seems a modest task. If the server streams, we will need to decouple the representation generation from the existing representation logic since that builds an in-memory representation which is then flushed. Instead we'll need a streaming representation implementation which seems to be a reasonable amount of engineering. We'll also need a new streaming binding to the REST server in JAX-RS land. I'm still a bit concerned about how rude it is for a client to just drop a streaming connection. I've asked Mark Nottingham for his authoritative opinion on that. But still, this does seem popular and feasible. Jim ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
[Neo4j] New blog post on non-graph stores for graph-y things
Hi guys, A while ago we were discussing using non-graph native backend for graph operations. I've finally gotten around to writing up my thoughts on the thread here: http://jim.webber.name/2011/04/21/e2f48ace-7dba-4709-8600-f29da3491cb4.aspx As always, I'd value your thoughts and feedback. Jim ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] Question about REST interface concurrency
Hi Peter, I'd be glad to share the code, I'll commit soon and share with the users list. I've run some more load/concurrency tests and am seeing some strange results. Maybe someone can help explain this to me: I run a load test where I fire off 100K create empty node REST requests to Neo as quickly as possible. With my code updates to allow configuration of the Jetty thread pool size, I can effectively reduce or increase the maximum concurrent transaction limit on the server. If I limit the thread pool so that there is only 1 thread available for requests, I see (as expected) the PeakNumberOfConcurrentTransactions reported by the Neo4j Transactions MBean is 1. If I scale the thread pool up so that there are 800 available request threads, I can throw enough load at the server to cause 800 concurrent transactions. From what I have read, node creation causes a node-local lock, not a global node lock, so there shouldn't be a lock-imposed concurrency bottleneck. The strange thing is, no matter whether I have 1 or 800 concurrent transactions, my total node creation throughput is always the same (~1600 nodes/sec). Even with 800 concurrent transactions, my server is only using ~15% CPU and ~25% memory (JVM Xmm/Xmx = 1024m/2048m), so server load wouldn't appear to be an issue. I've followed all the recommendations I could find including sysctl limits and JVM settings, but the rate doesn't change. I have also tried running the load test from multiple clients simultaneously (just to be sure I'm not running into any limits on the client machine), and indeed as soon as I add a second load test client, the throughput on each client gets cut in half. If I'm talking to Neo in a way that is unrestricted by things like thread pool size and concurrency limits, I'd expect to be able to scale up my load tests and see at least some level of throughput improvement until I start to saturate/overload the box. The fact that increasing concurrency doesn't increase throughput makes me think that there's some internal bottleneck or synchronization point that's limiting. Any thoughts? I'm glad to look through the code and investigate, any ideas you have would be a big help. Thanks, and sorry for the long question! Stephen -Original Message- From: Peter Neubauer [mailto:peter.neuba...@neotechnology.com] Sent: Monday, April 18, 2011 12:50 AM To: Neo4j user discussions Subject: Re: [Neo4j] Question about REST interface concurrency Stephen, did you fork the code? Would be good to merge in the changes or at least take a look at them! Cheers, /peter neubauer GTalk: neubauer.peter Skype peter.neubauer Phone +46 704 106975 LinkedIn http://www.linkedin.com/in/neubauer Twitter http://twitter.com/peterneubauer http://www.neo4j.org - Your high performance graph database. http://startupbootcamp.org/ - Öresund - Innovation happens HERE. http://www.thoughtmade.com - Scandinavia's coolest Bring-a-Thing party. On Mon, Apr 18, 2011 at 4:08 AM, Stephen Roos sr...@careerarcgroup.com wrote: Hi Jim, Thanks for the quick reply. I tried the configuration mentioned here (rest_max_jetty_threads): https://trac.neo4j.org/changeset/6157/laboratory/components/rest But it doesn't seem to have changed anything. I took a look through the code and didn't see any configuration settings exposed in Jetty6WebServer. I added the changes myself and am starting to see some good results (I've exposed settings for min/max threadpool size, # acceptor threads, acceptor queue size, and request buffer size). Is there anything else that you'd recommend tweaking to improve throughput? Thanks again for your help! -Original Message- From: Jim Webber [mailto:j...@neotechnology.com] Sent: Friday, April 15, 2011 1:57 AM To: Neo4j user discussions Subject: Re: [Neo4j] Question about REST interface concurrency Hi Stephen, The same Jetty tweaks that worked in previous versions will work with 1.3. We haven't changed any of the Jetty stuff under the covers. Jim ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] Basic Node storage/retrieval related question?
Hi Karan, Are you using Spring Data Graph, or the native Neo4j API? David On Thu, Apr 21, 2011 at 10:21 AM, G vlin...@gmail.com wrote: I have a pojo with a field a. which i initialize like this Object a = 10; I store the POJO containing this field using neo4j.. When I load this POJO, I have a getter method to get the object Object getA() { return a; } *What should be the class type of a ? * I am of the opinion it should be java.lang.Integer but it is coming out to be java.lang.String I am assuming this is because of node.getProperty(... ) Is there a way I can get Integer object only. Also what all types can be stored ? thanks, Karan . ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user -- David Montag david.mon...@neotechnology.com Neo Technology, www.neotechnology.com Cell: 650.556.4411 Skype: ddmontag ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] REST results pagination
Jim, we should schedule a group chat on this topic. - Reply message - From: Jim Webber j...@neotechnology.com Date: Thu, Apr 21, 2011 11:01 am Subject: [Neo4j] REST results pagination To: Neo4j user discussions user@lists.neo4j.org This is indeed a good dialogue. The pagination versus streaming was something I'd previously had in my mind as orthogonal issues, but I like the direction this is going. Let's break it down to fundamentals: As a remote client, I want to be just as rich and performant as a local client. Unfortunately, Deutsch, Amdahl and Einstein are against me on that, and I don't think I am tough enough to defeat those guys. So what are my choices? I know I have to be more granular to try to alleviate some of the network penalty so doing operations bulkily sounds great. Now what I need to decide is whether I control the rate at which those bulk operations occur or whether the server does. If I want to control those operations, then paging seems sensible. Otherwise a streamed (chunked) encoding scheme would make sense if I'm happy for the server to throw results back at me at its own pace. Or indeed you can mix both so that pages are streamed. In either case if I get bored of those results, I'll stop paging or I'll terminate the connection. So what does this mean for implementation on the server? I guess this is important since it affects the likelihood of the Neo Tech team implementing it. If the server supports pagination, it means we need a paging controller in memory per paginated result set being created. If we assume that we'll only go forward in pages, that's effectively just a wrapper around the traversal that's been uploaded. The overhead should be modest, and apart from the paging controller and the traverser, it doesn't need much state. We would need to add some logic to the representation code to support next links, but that seems a modest task. If the server streams, we will need to decouple the representation generation from the existing representation logic since that builds an in-memory representation which is then flushed. Instead we'll need a streaming representation implementation which seems to be a reasonable amount of engineering. We'll also need a new streaming binding to the REST server in JAX-RS land. I'm still a bit concerned about how rude it is for a client to just drop a streaming connection. I've asked Mark Nottingham for his authoritative opinion on that. But still, this does seem popular and feasible. Jim ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] Strange performance difference on different machines
On 2011-04-20, at 7:30 AM, Tobias Ivarsson wrote: Sorry I got a bit distracted when writing this. I should have added that I then want you to send the results of running that benchmark to me so that I can further analyze what the cause of these slow writes might be. Thank you, Tobias That's what I figured you meant. Sorry for the delay, here they are: On a HP z400, quad Xeon W3550 @ 3.07GHz ext4 filesystem - dd if=/dev/urandom of=store bs=1M count=1000 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 111.175 s, 9.4 MB/s dd if=store of=/dev/null bs=100M 10+0 records in 10+0 records out 1048576000 bytes (1.0 GB) copied, 0.281153 s, 3.7 GB/s dd if=store of=/dev/null bs=100M 10+0 records in 10+0 records out 1048576000 bytes (1.0 GB) copied, 0.244339 s, 4.3 GB/s dd if=store of=/dev/null bs=100M 10+0 records in 10+0 records out 1048576000 bytes (1.0 GB) copied, 0.242583 s, 4.3 GB/s ./run ../store logfile 33 100 500 100 tx_count[100] records[31397] fdatasyncs[100] read[0.9881029 MB] wrote[1.9762058 MB] Time was: 5.012 19.952114 tx/s, 6264.365 records/s, 19.952114 fdatasyncs/s, 201.87897 kB/s on reads, 403.75793 kB/s on writes ./run ../store logfile 33 1000 5000 10 tx_count[10] records[30997] fdatasyncs[10] read[0.9755144 MB] wrote[1.9510288 MB] Time was: 0.604 16.556292 tx/s, 51319.54 records/s, 16.556292 fdatasyncs/s, 1653.8523 kB/s on reads, 3307.7046 kB/s on writes ./run ../store logfile 33 1000 5000 100 tx_count[100] records[298245] fdatasyncs[100] read[9.386144 MB] wrote[18.772287 MB] Time was: 199.116 0.5022198 tx/s, 1497.8455 records/s, 0.5022198 fdatasyncs/s, 48.270412 kB/s on reads, 96.540825 kB/s on writes procs ---memory-- ---swap-- -io -system-- cpu r b swpd free buff cache si sobibo in cs us sy id wa 1 2 0 8541712 336716 367094000 1 7 12 20 4 1 95 0 0 2 0 8525712 336716 367094800 0 979 1653 3186 4 1 60 35 1 2 0 8525220 336716 367120400 0 1244 1671 3150 4 1 71 24 0 2 0 8524724 336716 367133200 0 709 1517 3302 4 1 65 30 0 2 0 8524476 336716 367146000 0 1033 1680 69342 5 7 59 29 0 2 0 8539168 336716 367158800 0 1375 1599 3272 3 1 70 25 1 2 0 8538860 336716 367171600 0 1157 1594 3097 3 1 72 24 0 1 0 8541340 336716 367184400 0 1151 1512 3182 3 2 70 25 0 1 0 8524812 336716 367197200 0 1597 1641 3391 4 2 72 22 ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] Strange performance difference on different machines
Bob, I don't know if you have already answered these questions. Which JDK (also version) are you using for that, what are the JVM memory settings? Do you have a profiler handy that you could throw at your benchmark? (E.g. yourkit has a 30 day trial, other profilers surely too). Do you have the source code of your tests at hand? So we could run exactly the same code on our own Linux systems for cross checking? What Linux distribution is it, and 64 or 32 bit? Do you also have a disk formatted with ext3 to cross check? (Perhaps just a loopback device). How much memory does the linux box have available? Thanks so much. Michael Am 21.04.2011 um 21:53 schrieb Bob Hutchison: On 2011-04-20, at 7:30 AM, Tobias Ivarsson wrote: Sorry I got a bit distracted when writing this. I should have added that I then want you to send the results of running that benchmark to me so that I can further analyze what the cause of these slow writes might be. Thank you, Tobias That's what I figured you meant. Sorry for the delay, here they are: On a HP z400, quad Xeon W3550 @ 3.07GHz ext4 filesystem - dd if=/dev/urandom of=store bs=1M count=1000 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 111.175 s, 9.4 MB/s dd if=store of=/dev/null bs=100M 10+0 records in 10+0 records out 1048576000 bytes (1.0 GB) copied, 0.281153 s, 3.7 GB/s dd if=store of=/dev/null bs=100M 10+0 records in 10+0 records out 1048576000 bytes (1.0 GB) copied, 0.244339 s, 4.3 GB/s dd if=store of=/dev/null bs=100M 10+0 records in 10+0 records out 1048576000 bytes (1.0 GB) copied, 0.242583 s, 4.3 GB/s ./run ../store logfile 33 100 500 100 tx_count[100] records[31397] fdatasyncs[100] read[0.9881029 MB] wrote[1.9762058 MB] Time was: 5.012 19.952114 tx/s, 6264.365 records/s, 19.952114 fdatasyncs/s, 201.87897 kB/s on reads, 403.75793 kB/s on writes ./run ../store logfile 33 1000 5000 10 tx_count[10] records[30997] fdatasyncs[10] read[0.9755144 MB] wrote[1.9510288 MB] Time was: 0.604 16.556292 tx/s, 51319.54 records/s, 16.556292 fdatasyncs/s, 1653.8523 kB/s on reads, 3307.7046 kB/s on writes ./run ../store logfile 33 1000 5000 100 tx_count[100] records[298245] fdatasyncs[100] read[9.386144 MB] wrote[18.772287 MB] Time was: 199.116 0.5022198 tx/s, 1497.8455 records/s, 0.5022198 fdatasyncs/s, 48.270412 kB/s on reads, 96.540825 kB/s on writes procs ---memory-- ---swap-- -io -system-- cpu r b swpd free buff cache si sobibo in cs us sy id wa 1 2 0 8541712 336716 367094000 1 7 12 20 4 1 95 0 0 2 0 8525712 336716 367094800 0 979 1653 3186 4 1 60 35 1 2 0 8525220 336716 367120400 0 1244 1671 3150 4 1 71 24 0 2 0 8524724 336716 367133200 0 709 1517 3302 4 1 65 30 0 2 0 8524476 336716 367146000 0 1033 1680 69342 5 7 59 29 0 2 0 8539168 336716 367158800 0 1375 1599 3272 3 1 70 25 1 2 0 8538860 336716 367171600 0 1157 1594 3097 3 1 72 24 0 1 0 8541340 336716 367184400 0 1151 1512 3182 3 2 70 25 0 1 0 8524812 336716 367197200 0 1597 1641 3391 4 2 72 22 ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] REST results pagination
Really cool discussion so far, I would also prefer streaming over paging as with that approach we can give both ends more of the control they need. The server doesn't have to keep state over a long time (and also implement timeouts and clearing of that state, and keeping that state for lots of clients also adds up). The client can decide how much of the result he's interested in, if it is just 1 entry or 100k and then just drop the connection. Streaming calls can also have a request-timeout, so keeping those open for too long (with no activity) will close them automatically. Server doesn't use up lots of memory for streaming, one could even leverage the lazyness of traversers (and indexes) for not even executing/fetching results that are not going to be sent over the wire. This should accommodate every kind of client from the mobile phone which only lists a few entries, to the big machine that can eat a firehose of result data in milliseconds. For this kind of look-ahead support we could (and should) add an possible offset, so that a client can request data (whose order _he_ is sure hasn't changed) by having the server skipping the first n entries (so they don't have to be serialized/put on the wire). I also think that this streaming API could already address many of the pain-points of the current REST API. Perhaps we even want to provide a streaming interface in both directions, having the client being able to for instance stream the creation of nodes and relationships and their indexing without restarting a connection for each operation. Whatever comes in this stream could also be processed in one TX (or with TX tokens embedded in the stream the client could even control that). The only question that is posing here for me is if we want to put it on top of the existing REST API or rather create a more concise API/formats for that (with the later option of the format even degrading to binary for high bandwith interaction). I'd prefer the latter. Cheers Michael Am 21.04.2011 um 21:09 schrieb Rick Bullotta: Jim, we should schedule a group chat on this topic. - Reply message - From: Jim Webber j...@neotechnology.com Date: Thu, Apr 21, 2011 11:01 am Subject: [Neo4j] REST results pagination To: Neo4j user discussions user@lists.neo4j.org This is indeed a good dialogue. The pagination versus streaming was something I'd previously had in my mind as orthogonal issues, but I like the direction this is going. Let's break it down to fundamentals: As a remote client, I want to be just as rich and performant as a local client. Unfortunately, Deutsch, Amdahl and Einstein are against me on that, and I don't think I am tough enough to defeat those guys. So what are my choices? I know I have to be more granular to try to alleviate some of the network penalty so doing operations bulkily sounds great. Now what I need to decide is whether I control the rate at which those bulk operations occur or whether the server does. If I want to control those operations, then paging seems sensible. Otherwise a streamed (chunked) encoding scheme would make sense if I'm happy for the server to throw results back at me at its own pace. Or indeed you can mix both so that pages are streamed. In either case if I get bored of those results, I'll stop paging or I'll terminate the connection. So what does this mean for implementation on the server? I guess this is important since it affects the likelihood of the Neo Tech team implementing it. If the server supports pagination, it means we need a paging controller in memory per paginated result set being created. If we assume that we'll only go forward in pages, that's effectively just a wrapper around the traversal that's been uploaded. The overhead should be modest, and apart from the paging controller and the traverser, it doesn't need much state. We would need to add some logic to the representation code to support next links, but that seems a modest task. If the server streams, we will need to decouple the representation generation from the existing representation logic since that builds an in-memory representation which is then flushed. Instead we'll need a streaming representation implementation which seems to be a reasonable amount of engineering. We'll also need a new streaming binding to the REST server in JAX-RS land. I'm still a bit concerned about how rude it is for a client to just drop a streaming connection. I've asked Mark Nottingham for his authoritative opinion on that. But still, this does seem popular and feasible. Jim ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] REST results pagination
Half-baked thoughts from a neo4j newbie hacker type on this topic: 1) I think it is very important, even with modern infrastructures, for the client to be able to optionally throttle the result set it generates with a query as it sees fit, and not just because of client memory and bandwidth limitations. With regular old SQL databases if you send a careless large query, you can chew up significant system resources, for significant amounts of time while it is being processed. At a minimum, a rowcount/pagination option allows you to build something into your client which can minimize accidental denial of service queries. I'm not sure if it is possible to construct a query against a large Neo4j database that would temporarily cripple it, but it wouldn't surprise me if you could. 2) Sometimes with regular old SQL databases I'll run a sanity check count() function with the query to just return the size of the expected result set before I try to pull it back into my data structure. Many times count() is all I needed anyhow. Does Neo4j have a result set size function? Perhaps a client that really could only handle small result sets could run a count(), and then filter the search somehow, if necessary, until the count() was smaller? (I guess it would depend on the problem domain...) In other words it may be possible, when it is really important, to implement pagination logic on the client side, if you don't mind running multiple queries for each set of data you get back. 3) If the result set was broken into pages, you could organize the pages in the server with a set of [temporary] graph nodes with relationships to the results in the database -- one node for each page, and a parent node for the result set. If order of the pages is important, you could add directed relationships between the page nodes. If the order within the pages is important you could either apply a sequence numbering to the page-result relationship, or add directed temporary result set directed relationships too. Subsequent page retrievals would be new traversals based on the search result set graph. In a sense you would be building a temporary graph-index I suppose. And advantage to organizing search result sets this way is that you could then union and intersect result sets (and do other set operations) without a huge memory overhead. (Which means you could probably store millions of search results at one time, and you could persist them through restarts.) 4) In some HA architectures you may have multiple database copies behind a load balancer. Would the search result pages be stored equally on all of them? Would the client require a sticky flag, to always go back to the same specific server instance for more pages? Depending on how fast writes get propagated across the cluster (compared to requests for the next page), if you were creating nodes as described in (3) would that work? 5) As for sorting: In my experience, if I need a result set sorted from a regular SQL database, I will usually sort it myself. Most databases I've ever worked with routinely have performance problems. You can minimize finger pointing and the risk of complicating those other performance problems by just directing the database to get me what I need, I'll do the rest of it back in the client. On the other hand, sometimes it is quicker and easier to let the database do the work. (Usually when I can only handle the data in small chunks on the client.) What I'm trying to say, is that I think sorting is going to be more important to clients who want paginated results (ie, using resource limited clients), than to clients who are grabbing large chunks of data at a time (and will want to own any post-query processing steps anyhow). -- Rick Otten rot...@windfish.net O=='=+ ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] REST results pagination
Rick, great thoughts. Good catch, forgot to add the in-graph representation of the results to my mail, thanks for adding that part. Temporary (transient) nodes and relationships would really rock here, with the advantage that with HA you have them distributed to all cluster nodes. Certainly Craig has to add some interesting things to this, as those resemble probably his in graph indexes / R-Trees. As traversers are lazy a count operation is not so easily possible, you could run the traversal and discard the results. But then the client could also just pull those results until it reaches its internal tresholds and then decide to use more filtering or stop the pulling and ask the user for more filtering (you can always retrieve n+1 and show the user that there are more that n results available). The index result size() method only returns an estimate of the result size (which might not contain currently changed index entries). Please don't forget that a count() query in a RDBMS can be as ridicully expensive as the original query (especially if just the column selection was replaced with count, and sorting, grouping etc was still left in place together with lots of joins). Sorting on your own instead of letting the db do that mostly harms the performance as it requires you to build up all the data in memory, sort it and then use it. Instead of having the db do that more efficiently, stream the data and you can use it directly from the stream. Cheers Michael Am 21.04.2011 um 23:04 schrieb Rick Otten: Half-baked thoughts from a neo4j newbie hacker type on this topic: 1) I think it is very important, even with modern infrastructures, for the client to be able to optionally throttle the result set it generates with a query as it sees fit, and not just because of client memory and bandwidth limitations. With regular old SQL databases if you send a careless large query, you can chew up significant system resources, for significant amounts of time while it is being processed. At a minimum, a rowcount/pagination option allows you to build something into your client which can minimize accidental denial of service queries. I'm not sure if it is possible to construct a query against a large Neo4j database that would temporarily cripple it, but it wouldn't surprise me if you could. 2) Sometimes with regular old SQL databases I'll run a sanity check count() function with the query to just return the size of the expected result set before I try to pull it back into my data structure. Many times count() is all I needed anyhow. Does Neo4j have a result set size function? Perhaps a client that really could only handle small result sets could run a count(), and then filter the search somehow, if necessary, until the count() was smaller? (I guess it would depend on the problem domain...) In other words it may be possible, when it is really important, to implement pagination logic on the client side, if you don't mind running multiple queries for each set of data you get back. 3) If the result set was broken into pages, you could organize the pages in the server with a set of [temporary] graph nodes with relationships to the results in the database -- one node for each page, and a parent node for the result set. If order of the pages is important, you could add directed relationships between the page nodes. If the order within the pages is important you could either apply a sequence numbering to the page-result relationship, or add directed temporary result set directed relationships too. Subsequent page retrievals would be new traversals based on the search result set graph. In a sense you would be building a temporary graph-index I suppose. And advantage to organizing search result sets this way is that you could then union and intersect result sets (and do other set operations) without a huge memory overhead. (Which means you could probably store millions of search results at one time, and you could persist them through restarts.) 4) In some HA architectures you may have multiple database copies behind a load balancer. Would the search result pages be stored equally on all of them? Would the client require a sticky flag, to always go back to the same specific server instance for more pages? Depending on how fast writes get propagated across the cluster (compared to requests for the next page), if you were creating nodes as described in (3) would that work? 5) As for sorting: In my experience, if I need a result set sorted from a regular SQL database, I will usually sort it myself. Most databases I've ever worked with routinely have performance problems. You can minimize finger pointing and the risk of complicating those other performance problems by just directing the database to get me what I need, I'll do the rest of it back in the client. On the other hand, sometimes it
[Neo4j] Sobre neo
Gracias Javier por tu ayuda, muy utiles esos articulos ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
[Neo4j] about two database
Hello list i have some query, 1-i have 2 database (graph), I need to get information from one database to another without having to take the target database instance of another database. 2- i need know how to open a database(graph) if it already exists. thanks beforehand ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] Question about REST interface concurrency
I'm running on Linux (2.6.18). Watching network utilization, I never see rates higher than ~2.5 MBps on the server. I've also set net.core.rmem_min/max and net.ipv4.tcp_rmem/wmem in sysctl to be quite high based on some recommendations I've found. Is this contrary to your own load tests? Are you able to hit the server with enough load that the system is maxed out? I was considering adding some instrumentation around transactions so that I can see the average internal transaction time span during a load test. If you have any other thoughts on what to look for/test, I'd be very appreciative. Thanks again, Stephen -Original Message- From: Jim Webber [mailto:j...@neotechnology.com] Sent: Thursday, April 21, 2011 12:24 PM To: Neo4j user discussions Subject: Re: [Neo4j] Question about REST interface concurrency Hi Stephen, Are you running on Linux (or Windows) by any chance? I wonder whether the asymptotical performance you're seeing is because you've gotten to a point where you're exercising the IO channel and file system. Jim ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] Basic Node storage/retrieval related question?
Hello David, The problem is while developing using Spring Data Graph. Can you help me out with these issues. Best, Karan On Fri, Apr 22, 2011 at 12:19 AM, David Montag david.mon...@neotechnology.com wrote: Hi Karan, Are you using Spring Data Graph, or the native Neo4j API? David On Thu, Apr 21, 2011 at 10:21 AM, G vlin...@gmail.com wrote: I have a pojo with a field a. which i initialize like this Object a = 10; I store the POJO containing this field using neo4j.. When I load this POJO, I have a getter method to get the object Object getA() { return a; } *What should be the class type of a ? * I am of the opinion it should be java.lang.Integer but it is coming out to be java.lang.String I am assuming this is because of node.getProperty(... ) Is there a way I can get Integer object only. Also what all types can be stored ? thanks, Karan . ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user -- David Montag david.mon...@neotechnology.com Neo Technology, www.neotechnology.com Cell: 650.556.4411 Skype: ddmontag ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] Basic Node storage/retrieval related question?
David, and the issue is that I want to store different types of objects and retrieve them and call different methods using reflection where this value acts as parameter. Unfortunately, I am storing an Integer and getting back an instance of type String. Is there something I need to do differently ? Please let me know asap. Best, Karan On Fri, Apr 22, 2011 at 8:30 AM, G vlin...@gmail.com wrote: Hello David, The problem is while developing using Spring Data Graph. Can you help me out with these issues. Best, Karan On Fri, Apr 22, 2011 at 12:19 AM, David Montag david.mon...@neotechnology.com wrote: Hi Karan, Are you using Spring Data Graph, or the native Neo4j API? David On Thu, Apr 21, 2011 at 10:21 AM, G vlin...@gmail.com wrote: I have a pojo with a field a. which i initialize like this Object a = 10; I store the POJO containing this field using neo4j.. When I load this POJO, I have a getter method to get the object Object getA() { return a; } *What should be the class type of a ? * I am of the opinion it should be java.lang.Integer but it is coming out to be java.lang.String I am assuming this is because of node.getProperty(... ) Is there a way I can get Integer object only. Also what all types can be stored ? thanks, Karan . ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user -- David Montag david.mon...@neotechnology.com Neo Technology, www.neotechnology.com Cell: 650.556.4411 Skype: ddmontag ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user