Re: [Neo4j] REST results pagination
Here at 49 responses, I'd like to reiterate Craigs point earlier that we are really talking about several separate issues, and I'm wondering if we should split this discussion up, because it is getting very hard to follow it. As I see it, we are looking at* *three things: * Paging** **Use case: *UI code that presents a paged, infinite scrolled or similar interface to the user. Peeking at results for debugging or other purposes. * Streaming** **Use case: *Returning huge data sets without killing anyone. * Sorting **Use case: *Presenting lists of things to users, applications that care about the order of results for some other reason. I think we've come to agree that streaming and paging serve similar but different purposes and are not quite able to replace each others functionality. I also took the liberty to elevate sorting to it's own topic, because I believe it should be a generic thing you can do on a result set, whereas paging and streaming are different means of returning a result set. If we want to continue this discussion, would anyone object to splitting it into these three parts? /Jake On Sun, Apr 24, 2011 at 2:18 AM, Rick Bullotta rick.bullo...@thingworx.comwrote: Let's discuss sometime soon. Creating resources that need to be cached or saved in session state bring with them a whole bunch of negative aspects... - Reply message - From: Michael Hunger michael.hun...@neotechnology.com Date: Fri, Apr 22, 2011 10:57 pm Subject: [Neo4j] REST results pagination To: Neo4j user discussions user@lists.neo4j.org I spent some time looking at what others are doing for inspiration. I kind of like the Riak/Basho approach with multipart-chunks and the approach of explictely creating a resource for the query that can be navigated (either via pages or first,next,[prev,last] links) and expires (and could be reconstructed). Cheers Michael Good discussion: http://stackoverflow.com/questions/924472/paging-in-a-rest-collection CouchDB: http://wiki.apache.org/couchdb/HTTP_Document_API startKey + limit, endKey + limit, sorting, insert/update order Mongooese: [cursor-id]+batch_size OrientDB: .../[limit] Sones: no real rest API, but a SQL on top of the graph: http://developers.sones.de/documentation/graph-query-language/select/ with limit, offset, but also depth (for graph) HBase explcitly creates scanners, which can be then access with next operations, and expire after no activity for a certain timeout riak: http://wiki.basho.com/REST-API.html client-id header for client identification - sticky? optional query parameters for including properties, and if to stream the data keys=[true,false,stream] If “keys=stream”, the response will be transferred using chunked-encoding, where each chunk is a JSON object. The first chunk will contain the “props” entry (if props was not set to false). Subsequent chunks will contain individual JSON objects with the “keys” entry containing a sublist of the total keyset (some sublists may be empty). riak seems to support partial json, non closed elements: -d '{props:{n_val:5' returns multiple responses in one go, Content-Type: multipart/mixed; boundary=YinLMzyUR9feB17okMytgKsylvh --YinLMzyUR9feB17okMytgKsylvh Content-Type: application/x-www-form-urlencoded Link: /riak/test; rel=up Etag: 16vic4eU9ny46o4KPiDz1f Last-Modified: Wed, 10 Mar 2010 18:01:06 GMT {bar:baz} (this block can be repeated n times) --YinLMzyUR9feB17okMytgKsylvh-- * Connection #0 to host 127.0.0.1 left intact * Closing connection #0 Query results: Content-Type – always multipart/mixed, with a boundary specified Understanding the response body The response body will always be multipart/mixed, with each chunk representing a single phase of the link-walking query. Each phase will also be encoded in multipart/mixed, with each chunk representing a single object that was found. If no objects were found or “keep” was not set on the phase, no chunks will be present in that phase. Objects inside phase results will include Location headers that can be used to determine bucket and key. In fact, you can treat each object-chunk similarly to a complete response from read object, without the status code. HTTP/1.1 200 OK Server: MochiWeb/1.1 WebMachine/1.6 (eat around the stinger) Expires: Wed, 10 Mar 2010 20:24:49 GMT Date: Wed, 10 Mar 2010 20:14:49 GMT Content-Type: multipart/mixed; boundary=JZi8W8pB0Z3nO3odw11GUB4LQCN Content-Length: 970 --JZi8W8pB0Z3nO3odw11GUB4LQCN Content-Type: multipart/mixed; boundary=OjZ8Km9J5vbsmxtcn1p48J91cJP --OjZ8Km9J5vbsmxtcn1p48J91cJP Content-Type: application/json Etag: 3pvmY35coyWPxh8mh4uBQC Last-Modified: Wed, 10 Mar 2010 20:14:13 GMT {riak:CAP} --OjZ8Km9J5vbsmxtcn1p48J91cJP-- --JZi8W8pB0Z3nO3odw11GUB4LQCN Content-Type: multipart/mixed; boundary=RJKFlAs9PrdBNfd74HANycvbA8C --RJKFlAs9PrdBNfd74HANycvbA8C Location: /riak/test/doc2 Content-Type: application/json Etag
Re: [Neo4j] REST results pagination
In addition to what Jake just said about splitting this thread apart, I'd like to bring up what Rick suggested about getting together to thrash this out. Can you guys think about when we might want a skype call for this? We have to take into account timezones to cover from CET through to PST (unless anyone is even further east?). Jim ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] REST results pagination
On Fri, 2011-04-22 at 17:43 +0100, Jim Webber wrote: Hi Michael, Just in case we're not talking about the same kind of streaming -- when I think streaming, I think streaming uploads, streaming downloads, etc. I'm thinking chunked transfers. That is the server starts sending a response and then eventually terminates it when the whole response has been sent to the client. Although it seems a bit rude, the client could simply opt to close the connection when it's read enough providing what it has read makes sense. Sometimes document fragments can make sense: In this case we certainly don't have well-formed XML, but some streaming API (e.g. stax) might already have been able to create some local objects on the client side as the Earth and Mars nodes came in. I don't think this is elegant at all, but it might be practical. I've asked Mark Nottingham for his view on this since he's pretty sensible about Web things. Any intermediate proxies would have to cache the whole thing; many proxies are not designed for streaming responses so might read the whole thing before relaying it (although they seem to be getting a bit better at this with video over http). So the server would probably end up generating the whole thing if there was a proxy in the path. I think its workable, but not sure it is ideal... Justin ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] REST results pagination
Let's discuss sometime soon. Creating resources that need to be cached or saved in session state bring with them a whole bunch of negative aspects... - Reply message - From: Michael Hunger michael.hun...@neotechnology.com Date: Fri, Apr 22, 2011 10:57 pm Subject: [Neo4j] REST results pagination To: Neo4j user discussions user@lists.neo4j.org I spent some time looking at what others are doing for inspiration. I kind of like the Riak/Basho approach with multipart-chunks and the approach of explictely creating a resource for the query that can be navigated (either via pages or first,next,[prev,last] links) and expires (and could be reconstructed). Cheers Michael Good discussion: http://stackoverflow.com/questions/924472/paging-in-a-rest-collection CouchDB: http://wiki.apache.org/couchdb/HTTP_Document_API startKey + limit, endKey + limit, sorting, insert/update order Mongooese: [cursor-id]+batch_size OrientDB: .../[limit] Sones: no real rest API, but a SQL on top of the graph: http://developers.sones.de/documentation/graph-query-language/select/ with limit, offset, but also depth (for graph) HBase explcitly creates scanners, which can be then access with next operations, and expire after no activity for a certain timeout riak: http://wiki.basho.com/REST-API.html client-id header for client identification - sticky? optional query parameters for including properties, and if to stream the data keys=[true,false,stream] If “keys=stream”, the response will be transferred using chunked-encoding, where each chunk is a JSON object. The first chunk will contain the “props” entry (if props was not set to false). Subsequent chunks will contain individual JSON objects with the “keys” entry containing a sublist of the total keyset (some sublists may be empty). riak seems to support partial json, non closed elements: -d '{props:{n_val:5' returns multiple responses in one go, Content-Type: multipart/mixed; boundary=YinLMzyUR9feB17okMytgKsylvh --YinLMzyUR9feB17okMytgKsylvh Content-Type: application/x-www-form-urlencoded Link: /riak/test; rel=up Etag: 16vic4eU9ny46o4KPiDz1f Last-Modified: Wed, 10 Mar 2010 18:01:06 GMT {bar:baz} (this block can be repeated n times) --YinLMzyUR9feB17okMytgKsylvh-- * Connection #0 to host 127.0.0.1 left intact * Closing connection #0 Query results: Content-Type – always multipart/mixed, with a boundary specified Understanding the response body The response body will always be multipart/mixed, with each chunk representing a single phase of the link-walking query. Each phase will also be encoded in multipart/mixed, with each chunk representing a single object that was found. If no objects were found or “keep” was not set on the phase, no chunks will be present in that phase. Objects inside phase results will include Location headers that can be used to determine bucket and key. In fact, you can treat each object-chunk similarly to a complete response from read object, without the status code. HTTP/1.1 200 OK Server: MochiWeb/1.1 WebMachine/1.6 (eat around the stinger) Expires: Wed, 10 Mar 2010 20:24:49 GMT Date: Wed, 10 Mar 2010 20:14:49 GMT Content-Type: multipart/mixed; boundary=JZi8W8pB0Z3nO3odw11GUB4LQCN Content-Length: 970 --JZi8W8pB0Z3nO3odw11GUB4LQCN Content-Type: multipart/mixed; boundary=OjZ8Km9J5vbsmxtcn1p48J91cJP --OjZ8Km9J5vbsmxtcn1p48J91cJP Content-Type: application/json Etag: 3pvmY35coyWPxh8mh4uBQC Last-Modified: Wed, 10 Mar 2010 20:14:13 GMT {riak:CAP} --OjZ8Km9J5vbsmxtcn1p48J91cJP-- --JZi8W8pB0Z3nO3odw11GUB4LQCN Content-Type: multipart/mixed; boundary=RJKFlAs9PrdBNfd74HANycvbA8C --RJKFlAs9PrdBNfd74HANycvbA8C Location: /riak/test/doc2 Content-Type: application/json Etag: 6dQBm9oYA1mxRSH0e96l5W Last-Modified: Wed, 10 Mar 2010 18:11:41 GMT {foo:bar} --RJKFlAs9PrdBNfd74HANycvbA8C-- --JZi8W8pB0Z3nO3odw11GUB4LQCN-- * Connection #0 to host 127.0.0.1 left intact * Closing connection #0 Riak - MapReduce: Optional query parameters: * chunked – when set to true, results will be returned one at a time in multipart/mixed format using chunked-encoding. Important headers: * Content-Type – application/json when chunked is not true, otherwise multipart/mixed with application/json parts Other interesting endpoints: /ping, /stats ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] REST results pagination
Good catch, forgot to add the in-graph representation of the results to my mail, thanks for adding that part. Temporary (transient) nodes and relationships would really rock here, with the advantage that with HA you have them distributed to all cluster nodes. Certainly Craig has to add some interesting things to this, as those resemble probably his in graph indexes / R-Trees. I certainly make use of this model, much more so for my statistical analysis than for graph indexes (but I'm planning to merge indexes and statistics). However, in my case the structures are currently very domain specific. But I think the idea is sound and should be generalizable. What I do is have a concept of a 'dataset' on which queries can be performed. The dataset is usually the root of a large sub-graph. The query parser (domain specific) creates a hashcode of the query, checks if the dataset node already has a resultset (as a connected sub-graph with its own root node containing the previous query hashcode), and if so return that (traverse it), otherwise perform the complete dataset traversal, creating the resultset as a new subgraph and then return it. This works well specifically for statistical queries, where the resultset is much smaller than the dataset, so adding new subgraphs has small impact on the database size, and the resultset is much faster to return, so this is a performance enhancement for multiple requests from the client. Also, I keep the resultset permanently, not temporarily. Very few operations modify the dataset, and if they do, we delete all resultsets, and they get re-created the next time. My work on merging the indexes with the statistics is also planned to only recreate 'dirty' subsets of the result-set, so modifying the dataset has minimal impact on the query performance. After reading Rick's previous email I started thinking of approaches to generalizing this, but I think your 'transient' nodes perhaps encompass everything I thought about. Here is an idea: - Have new nodes/relations/properties tables on disk, like a second graph database, but different in the sense that it has one-way relations into the main database, which cannot be seen by the main graph and so are by definition not part of the graph. These can have transience and expiry characteristics. Then we can build the resultset graphs as transient graphs in the transient database, with 'drill-down' capabilities to the original graph (something I find I always need for statistical queries, and something a graph is simply much better at than a relational database). - Use some kind of hashcode in the traversal definition or query to identify existing, cached, transient graphs in the second database, so you can rely on those for repeated queries, or pagination or streaming, etc. As traversers are lazy a count operation is not so easily possible, you could run the traversal and discard the results. But then the client could also just pull those results until it reaches its internal tresholds and then decide to use more filtering or stop the pulling and ask the user for more filtering (you can always retrieve n+1 and show the user that there are more that n results available). Yes. Count needs to perform the traversal. So the only way to not have to traverse twice is to keep a cache. If we make the cache a transient sub-graph (possibly in the second database I described above), then we have the interesting behaviour that count() takes a while, but subsequent queries, pagination or streaming, are fast. Please don't forget that a count() query in a RDBMS can be as ridicully expensive as the original query (especially if just the column selection was replaced with count, and sorting, grouping etc was still left in place together with lots of joins). Good to hear they have the same problem as us :-) (or even more problems) Sorting on your own instead of letting the db do that mostly harms the performance as it requires you to build up all the data in memory, sort it and then use it. Instead of having the db do that more efficiently, stream the data and you can use it directly from the stream. Client side sorting makes sense if you know the domain well enough to know, for example, you will receive a small enough result set to 'fit' in the client, and want to give the user multiple interactive sort options without hitting the database again. But I agree that in general it makes sense to get the database to do the sort. Cheers, Craig ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] REST results pagination
Client side sorting makes sense if you know the domain well enough to know, for example, you will receive a small enough result set to 'fit' in the client, and want to give the user multiple interactive sort options without hitting the database again. But I agree that in general it makes sense to get the database to do the sort. I'll concede this point. In general it should be better to do the sorts on the database server, which is typically by design a hefty backend system that is optimized for that sort of processing. In my experience with regular SQL databases, unfortunately they typically only scale vertically, and are usually running on expensive enterprise-grade hardware. Most of the ones I've worked either run on minimally sized hardware or have quickly outgrown their hardware. So they are always either: 1) Currently suffering from a capacity problem. 2) Just recovering from a capacity problem. 3) Heading rapidly towards a new capacity problem. The next problem I run into is a political, rather than technical one. The database administration team is often a different group of people from the appserver/front end development team. The guys writing the queries are usually closer to the appserver than the database. In other words, it is easier for them to manage a problem in the appserver, than it is to manage a problem in the database. So, instead of having a deep well of data processing power to draw on, and then using a wide layer of thin commodity hardware presentation layer servers, we end up transferring data processing power out of the data server and into the presentation layer. As we evolve into building data processing systems which can scale horizontally on commodity hardware, the perpetual capacity problems the legacy vertical databases suffer from may wane, finally freeing the other layers from having to pick up some of the slack. -- Rick Otten rot...@windfish.net O=='=+ ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] REST results pagination
On Thu, Apr 21, 2011 at 11:18 PM, Michael Hunger michael.hun...@neotechnology.com wrote: Rick, great thoughts. Good catch, forgot to add the in-graph representation of the results to my mail, thanks for adding that part. Temporary (transient) nodes and relationships would really rock here, with the advantage that with HA you have them distributed to all cluster nodes. Certainly Craig has to add some interesting things to this, as those resemble probably his in graph indexes / R-Trees. As traversers are lazy a count operation is not so easily possible, you could run the traversal and discard the results. But then the client could also just pull those results until it reaches its internal tresholds and then decide to use more filtering or stop the pulling and ask the user for more filtering (you can always retrieve n+1 and show the user that there are more that n results available). The index result size() method only returns an estimate of the result size (which might not contain currently changed index entries). Please don't forget that a count() query in a RDBMS can be as ridicully expensive as the original query (especially if just the column selection was replaced with count, and sorting, grouping etc was still left in place together with lots of joins). Sorting on your own instead of letting the db do that mostly harms the performance as it requires you to build up all the data in memory, sort it and then use it. Instead of having the db do that more efficiently, stream the data and you can use it directly from the stream. throw new SlapOnTheFingersException(sometimes the application developer can do a better job since she has better knowledge of the data, the database only has generic knowledge); Since Jake had already mentioned (in this very thread) that he expected one of those, I thought I might as well throw one in there. I agree with the analysis of count(), as the name (count) implies, it will have to run the entire query in order to count the number of resulting items. About sorting I'm torn. The perception of sorting in the database being slow that Rick points to is one that I've seen a lot. When you hand the responsibility of sorting to the database you hide the fact that sorting is an expensive operation, it does require reading in all data in order to sort it. People often expect databases to be smarter than that, since they sometimes are, but that is pretty much only when reading straight from an index and not doing much more. A generic sort of data can never be better than O(log(n!)) [O(log(n!)) is almost equal to, and commonly rounded to the easier to compute function O(n log(n))]. If you put the responsibility of sorting in the hands of the application you can sometimes utilize knowledge about the data to do a more efficient sorting than the database could have done. Most often by simply doing an application level filtering of the data before sorting it, based on some filtering that could not be transfered to the database query. This does however make the work of the application developer slightly more tedious, which is why I think it would be sensible to have support for sorting on the database level, and hope that users will be sensible about using it, and not assume magic from it. Something I find very interesting is the concept of semi-sorted data. Semi-sorted data is often good enough, easier to achieve, and quite easy to then sort completely if that is required. Examples of semi-sorted data could be data in an order that satisfies the heap property. Or for spatial queries returning the closest hits first, but not necessarily in perfect order, say returning the hits within a miles radius first, before the ones in a radius between 1-10 miles, and so on, without requiring the hits in each 'segment' to be perfectly ordered by distance. Breadth first order is another example of semi-sorted data, that could be used when traversing data as you've outlined with paging nodes, or similarly grouped by parent node-order. I must say that I really enjoy following this discussion. I really like the idea of streaming, since I think that can be implemented more easily than paging, while satisfying many of the desired use cases. But I still want to hear more arguments for and against both alternatives. And as has already been pointed out, they aren't mutually exclusive. I'll keep listening in on the conversation, but I don't have much more to add at this point. I have one desire for the structure of the conversation though. When you quote what someone else has said before you, could you please include who that person was, it makes going back and reading the full context easier. Cheers, -- Tobias Ivarsson tobias.ivars...@neotechnology.com Hacker, Neo Technology www.neotechnology.com Cellphone: +46 706 534857 ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] REST results pagination
Hi Michael, Just in case we're not talking about the same kind of streaming -- when I think streaming, I think streaming uploads, streaming downloads, etc. I'm thinking chunked transfers. That is the server starts sending a response and then eventually terminates it when the whole response has been sent to the client. Although it seems a bit rude, the client could simply opt to close the connection when it's read enough providing what it has read makes sense. Sometimes document fragments can make sense: results node id=1234 property name=planet value=Earth/ /node node id=1235 property name=planet value=Mars/ /node !-- client gets bored here and kills the connection missing out on what would have followed -- node id=1236 property name=planet value=Jupiter/ /node node id=1237 property name=planet value=Saturn/ /node /results In this case we certainly don't have well-formed XML, but some streaming API (e.g. stax) might already have been able to create some local objects on the client side as the Earth and Mars nodes came in. I don't think this is elegant at all, but it might be practical. I've asked Mark Nottingham for his view on this since he's pretty sensible about Web things. Jim ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] REST results pagination
Hi Georg, It would at least have to be an iterator over pages - otherwise the results tend to be fine-grained and so horribly inefficient for sending over a network. Jim On 22 Apr 2011, at 18:24, Georg Summer wrote: I might be a little newbish here, but then why not an Iterator? The iterator lives on the server and is accessible through the REST interface, providing a advance and value method. It either operates on a stored and once-created-stable result set or holds the query and evaluates it on demand (issues of changing underlying graph included). The client can have paginator functionality by advancing and derefing the iterator n times or streaming-like behaviour by constantly pushing the obtained data into a queue and keep on going. If the client does not need the iterator anymore he simple stops using it and a timeout kills it eventually on the server. a client-callable delete method for the iterator would work as well. Georg ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] REST results pagination
Sax (or stax) is an example of streaming with a higher level format, but there are plenty of other ways as well. The *critical* performance element is to *never* have to accumulate an entire intermediate document on either side (eg json object or xml Dom) if you can avoid it. You end up requiring 4x the resources (or more), extra latency, more parsing, and more garbage collection. I'll get with Jim webber and propose a prototype of alternatives. Note also that the lack of binary I/O in the browser without flash/java/silverlight is a challenge, but we can work around it. - Reply message - From: Michael DeHaan michael.deh...@gmail.com Date: Fri, Apr 22, 2011 12:18 pm Subject: [Neo4j] REST results pagination To: Neo4j user discussions user@lists.neo4j.org On Thu, Apr 21, 2011 at 5:00 PM, Michael Hunger michael.hun...@neotechnology.com wrote: Really cool discussion so far, I would also prefer streaming over paging as with that approach we can give both ends more of the control they need. Just in case we're not talking about the same kind of streaming -- when I think streaming, I think streaming uploads, streaming downloads, etc. If the REST format is JSON (or XML, whatever), that's a /document/ so you can't just say read the next (up to) 512 bytes and work on it. It becomes a more low-level endeavor because if you're in the middle of reading a record, or don't even have the end of list terminator, what you have isn't parseable yet. I'm sure a lot of hacking could be done to make the client figure out if he had enough other than the closing array element, but it's a lot to ask of a JSON client. So I'm interested in how, in that proposal, the REST API might stream results to a client, because for the streaming to be meaningful, you need to be able to parse what you get back and know where the boundaries are (or build a buffer until you fill in a datastructure enough to operate on it). I don't see that working with JSON/REST so much. It seems to imply a message bus. --Michael ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] REST results pagination
That would need to hold resources on the server (potentially for an indeterminate amount of time) since it must be stateful. In general, stateful apis do not scale well in cases of dynamic queries. - Reply message - From: Georg Summer georg.sum...@gmail.com Date: Fri, Apr 22, 2011 1:25 pm Subject: [Neo4j] REST results pagination To: Neo4j user discussions user@lists.neo4j.org I might be a little newbish here, but then why not an Iterator? The iterator lives on the server and is accessible through the REST interface, providing a advance and value method. It either operates on a stored and once-created-stable result set or holds the query and evaluates it on demand (issues of changing underlying graph included). The client can have paginator functionality by advancing and derefing the iterator n times or streaming-like behaviour by constantly pushing the obtained data into a queue and keep on going. If the client does not need the iterator anymore he simple stops using it and a timeout kills it eventually on the server. a client-callable delete method for the iterator would work as well. Georg On 22 April 2011 18:43, Jim Webber j...@neotechnology.com wrote: Hi Michael, Just in case we're not talking about the same kind of streaming -- when I think streaming, I think streaming uploads, streaming downloads, etc. I'm thinking chunked transfers. That is the server starts sending a response and then eventually terminates it when the whole response has been sent to the client. Although it seems a bit rude, the client could simply opt to close the connection when it's read enough providing what it has read makes sense. Sometimes document fragments can make sense: results node id=1234 property name=planet value=Earth/ /node node id=1235 property name=planet value=Mars/ /node !-- client gets bored here and kills the connection missing out on what would have followed -- node id=1236 property name=planet value=Jupiter/ /node node id=1237 property name=planet value=Saturn/ /node /results In this case we certainly don't have well-formed XML, but some streaming API (e.g. stax) might already have been able to create some local objects on the client side as the Earth and Mars nodes came in. I don't think this is elegant at all, but it might be practical. I've asked Mark Nottingham for his view on this since he's pretty sensible about Web things. Jim ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] REST results pagination
I'll be happy to host the streaming rest api summit. Ample amounts of beer will be provided.;-) - Reply message - From: Jim Webber j...@neotechnology.com Date: Fri, Apr 22, 2011 1:46 pm Subject: [Neo4j] REST results pagination To: Neo4j user discussions user@lists.neo4j.org Hi Georg, It would at least have to be an iterator over pages - otherwise the results tend to be fine-grained and so horribly inefficient for sending over a network. Jim On 22 Apr 2011, at 18:24, Georg Summer wrote: I might be a little newbish here, but then why not an Iterator? The iterator lives on the server and is accessible through the REST interface, providing a advance and value method. It either operates on a stored and once-created-stable result set or holds the query and evaluates it on demand (issues of changing underlying graph included). The client can have paginator functionality by advancing and derefing the iterator n times or streaming-like behaviour by constantly pushing the obtained data into a queue and keep on going. If the client does not need the iterator anymore he simple stops using it and a timeout kills it eventually on the server. a client-callable delete method for the iterator would work as well. Georg ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] REST results pagination
And you would want to reuse your connection so you don't have to pay this penalty per request Just asking how would duch a REST Resource iterator look like -URI, verbs, request,response formats? I assume then evety query (index,traversal) would just return the iterator URI for later consumption. If we store the query and/or result information (as discussed by Crsig and others) at the node returned as iterator this would be a nice fit. M Sent from my iBrick4 Am 22.04.2011 um 19:46 schrieb Jim Webber j...@neotechnology.com: Hi Georg, It would at least have to be an iterator over pages - otherwise the results tend to be fine-grained and so horribly inefficient for sending over a network. Jim On 22 Apr 2011, at 18:24, Georg Summer wrote: I might be a little newbish here, but then why not an Iterator? The iterator lives on the server and is accessible through the REST interface, providing a advance and value method. It either operates on a stored and once-created-stable result set or holds the query and evaluates it on demand (issues of changing underlying graph included). The client can have paginator functionality by advancing and derefing the iterator n times or streaming-like behaviour by constantly pushing the obtained data into a queue and keep on going. If the client does not need the iterator anymore he simple stops using it and a timeout kills it eventually on the server. a client-callable delete method for the iterator would work as well. Georg ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] REST results pagination
I spent some time looking at what others are doing for inspiration. I kind of like the Riak/Basho approach with multipart-chunks and the approach of explictely creating a resource for the query that can be navigated (either via pages or first,next,[prev,last] links) and expires (and could be reconstructed). Cheers Michael Good discussion: http://stackoverflow.com/questions/924472/paging-in-a-rest-collection CouchDB: http://wiki.apache.org/couchdb/HTTP_Document_API startKey + limit, endKey + limit, sorting, insert/update order Mongooese: [cursor-id]+batch_size OrientDB: .../[limit] Sones: no real rest API, but a SQL on top of the graph: http://developers.sones.de/documentation/graph-query-language/select/ with limit, offset, but also depth (for graph) HBase explcitly creates scanners, which can be then access with next operations, and expire after no activity for a certain timeout riak: http://wiki.basho.com/REST-API.html client-id header for client identification - sticky? optional query parameters for including properties, and if to stream the data keys=[true,false,stream] If “keys=stream”, the response will be transferred using chunked-encoding, where each chunk is a JSON object. The first chunk will contain the “props” entry (if props was not set to false). Subsequent chunks will contain individual JSON objects with the “keys” entry containing a sublist of the total keyset (some sublists may be empty). riak seems to support partial json, non closed elements: -d '{props:{n_val:5' returns multiple responses in one go, Content-Type: multipart/mixed; boundary=YinLMzyUR9feB17okMytgKsylvh --YinLMzyUR9feB17okMytgKsylvh Content-Type: application/x-www-form-urlencoded Link: /riak/test; rel=up Etag: 16vic4eU9ny46o4KPiDz1f Last-Modified: Wed, 10 Mar 2010 18:01:06 GMT {bar:baz} (this block can be repeated n times) --YinLMzyUR9feB17okMytgKsylvh-- * Connection #0 to host 127.0.0.1 left intact * Closing connection #0 Query results: Content-Type – always multipart/mixed, with a boundary specified Understanding the response body The response body will always be multipart/mixed, with each chunk representing a single phase of the link-walking query. Each phase will also be encoded in multipart/mixed, with each chunk representing a single object that was found. If no objects were found or “keep” was not set on the phase, no chunks will be present in that phase. Objects inside phase results will include Location headers that can be used to determine bucket and key. In fact, you can treat each object-chunk similarly to a complete response from read object, without the status code. HTTP/1.1 200 OK Server: MochiWeb/1.1 WebMachine/1.6 (eat around the stinger) Expires: Wed, 10 Mar 2010 20:24:49 GMT Date: Wed, 10 Mar 2010 20:14:49 GMT Content-Type: multipart/mixed; boundary=JZi8W8pB0Z3nO3odw11GUB4LQCN Content-Length: 970 --JZi8W8pB0Z3nO3odw11GUB4LQCN Content-Type: multipart/mixed; boundary=OjZ8Km9J5vbsmxtcn1p48J91cJP --OjZ8Km9J5vbsmxtcn1p48J91cJP Content-Type: application/json Etag: 3pvmY35coyWPxh8mh4uBQC Last-Modified: Wed, 10 Mar 2010 20:14:13 GMT {riak:CAP} --OjZ8Km9J5vbsmxtcn1p48J91cJP-- --JZi8W8pB0Z3nO3odw11GUB4LQCN Content-Type: multipart/mixed; boundary=RJKFlAs9PrdBNfd74HANycvbA8C --RJKFlAs9PrdBNfd74HANycvbA8C Location: /riak/test/doc2 Content-Type: application/json Etag: 6dQBm9oYA1mxRSH0e96l5W Last-Modified: Wed, 10 Mar 2010 18:11:41 GMT {foo:bar} --RJKFlAs9PrdBNfd74HANycvbA8C-- --JZi8W8pB0Z3nO3odw11GUB4LQCN-- * Connection #0 to host 127.0.0.1 left intact * Closing connection #0 Riak - MapReduce: Optional query parameters: * chunked – when set to true, results will be returned one at a time in multipart/mixed format using chunked-encoding. Important headers: * Content-Type – application/json when chunked is not true, otherwise multipart/mixed with application/json parts Other interesting endpoints: /ping, /stats ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] REST results pagination
Legacy application that just uses a new data source. It can be quite hard to get users away from their trusty old-chap-UI. In the case of Pagination, Legacy might only mean some years but still legacy :-). @1-2) In the wake of mobile applications and mobile sites a pagination system might be more relevant than bulk loading everything and displaying it. defining smart filters might be problematic in such a use case as well. Parallelism of an application could also be a interesting aspect. Each worker retrieves the different pages of the graph and the user does not have to care at all about separating the graph after downloading it. This would only be interesting though if the graph relations are not important. Georg On 21 April 2011 14:59, Rick Bullotta rick.bullo...@thingworx.com wrote: Fwiw, I think paging is an outdated crutch, for a few reasons: 1) bandwidth and browser processing/parsing are largely non issues, although they used to be 2) human users rarely have the patience (and usability sucks) to go beyond 2-4 pages of information. It is far better to allow incrementally refined filters and searches to get to a workable subset of data. 3) machine users could care less about paging 4) when doing visualization of a large dataset, you generally want the whole dataset, not a page of it, so that's another non use case Discuss and debate please! Rick - Reply message - From: Craig Taverner cr...@amanzi.com Date: Thu, Apr 21, 2011 8:52 am Subject: [Neo4j] REST results pagination To: Neo4j user discussions user@lists.neo4j.org I assume this: Traverser x = Traversal.description().traverse( someNode ); x.nodes(); x.nodes(); // Not necessarily in the same order as previous call. If that assumption is false or there is some workaround, then I agree that this is a valid approach, and a good efficient alternative when sorting is not relevant. Glancing at the code in TraverserImpl though, it really looks like the call to .nodes will re-run the traversal, and I thought that would mean the two calls can yield results in different order? OK. My assumptions were different. I assume that while the order is not easily predictable, it is reproducable as long as the underlying graph has not changed. If the graph changes, then the order can change also. But I think this is true of a relational database also, is it not? So, obviously pagination is expected (by me at least) to give page X as it is at the time of the request for page X, not at the time of the request for page 1. But my assumptions could be incorrect too... I understand, and completely agree. My problem with the approach is that I think its harder than it looks at first glance. I guess I cannot argue that point. My original email said I did not know if this idea had been solved yet. Since some of the key people involved in this have not chipped into this discussion, either we are reasonably correct in our ideas, or so wrong that they don't know where to begin correcting us ;-) This is what makes me push for the sorted approach - relational databases are doing this. I don't know how they do it, but they are, and we should be at least as good. Absolutely. We should be as good. Relational database manage to serve a page deep down the list quite fast. I must believe if they had to complete the traversal, sort the results and extract the page on every single page request, they could not be so fast. I think my ideas for the traversal are 'supposed' to be performance enhancements, and that is why I like them ;-) I agree the issue of what should be indexed to optimize sorting is a domain-specific problem, but I think that is how relational databases treat it as well. If you want sorting to be fast, you have to tell them to index the field you will be sorting on. The only difference contra having the user put the sorting index in the graph is that relational databases will handle the indexing for you, saving you a *ton* of work, and I think we should too. Yes. I was discussing automatic indexing with Mattias recently. I think (and hope I am right), that once we move to automatic indexes, then it will be possible to put external indexes (a'la lucene) and graph indexes (like the ones I favour) behind the same API. In this case perhaps the database will more easily be able to make the right optimized decisions, and use the index for providing sorted results fast and with low memory footprint where possible, based on the existance or non-existance of the necessary indices. Then all the developer needs to do to make things really fast is put in the right index. For some data, that would be lucene and for others it would be a graph index. If we get to this point, I think we will have closed a key usability gap with relational databases. There are cases where you need to add this sort of meta data to your domain model
Re: [Neo4j] REST results pagination
3) machine users could care less about paging My thoughts are that parsing very large documents can perform poorly and requires the entire document be slurped into (available) RAM. This puts a cap on the size of a usable resultset and slows processing, or at least makes you pay an up-front cost, and decreases potential for parallelism in other parts of your app?. ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] REST results pagination
Fwiw, we use an idiot resistant (no such thing as idiot proof) approach that clamps the number of returned items on the server side by default. We allow the user to explicitly request to do something foolish and ask for more data, but it requires a conscious effort. - Reply message - From: Jacob Hansson ja...@voltvoodoo.com Date: Thu, Apr 21, 2011 10:06 am Subject: [Neo4j] REST results pagination To: Neo4j user discussions user@lists.neo4j.org On Thu, Apr 21, 2011 at 2:59 PM, Rick Bullotta rick.bullo...@thingworx.comwrote: Fwiw, I think paging is an outdated crutch, for a few reasons: 1) bandwidth and browser processing/parsing are largely non issues, although they used to be I disagree. They have improved significantly, for sure, but that is no reason to download massive amounts of data that will never be used. 2) human users rarely have the patience (and usability sucks) to go beyond 2-4 pages of information. It is far better to allow incrementally refined filters and searches to get to a workable subset of data. I agree with the suckiness of paging and the awesomeness of filtering - but what do you do when the users filter returns 40 million results? You somehow have to tell the user that damn, that filter, it returned forty freaking million results, you need to refine your search buddy. The way the user expects that to happen is through presenting a paged, infinite scrolled or similar interface, where she can see how many results where returned and act on that feedback. 3) machine users could care less about paging Agreed, streaming is a much better way for machines to talk about data that doesn't fit in memory. 4) when doing visualization of a large dataset, you generally want the whole dataset, not a page of it, so that's another non use case Not necessarily true. You need all the data that you want to visualize, but that is not necessarily all the data the user has asked for. You can be clever about the visualization to keep it uncluttered, and paging-like behaviours may be a way to do that. Discuss and debate please! Rick - Reply message - From: Craig Taverner cr...@amanzi.com Date: Thu, Apr 21, 2011 8:52 am Subject: [Neo4j] REST results pagination To: Neo4j user discussions user@lists.neo4j.org I assume this: Traverser x = Traversal.description().traverse( someNode ); x.nodes(); x.nodes(); // Not necessarily in the same order as previous call. If that assumption is false or there is some workaround, then I agree that this is a valid approach, and a good efficient alternative when sorting is not relevant. Glancing at the code in TraverserImpl though, it really looks like the call to .nodes will re-run the traversal, and I thought that would mean the two calls can yield results in different order? OK. My assumptions were different. I assume that while the order is not easily predictable, it is reproducable as long as the underlying graph has not changed. If the graph changes, then the order can change also. But I think this is true of a relational database also, is it not? So, obviously pagination is expected (by me at least) to give page X as it is at the time of the request for page X, not at the time of the request for page 1. But my assumptions could be incorrect too... I understand, and completely agree. My problem with the approach is that I think its harder than it looks at first glance. I guess I cannot argue that point. My original email said I did not know if this idea had been solved yet. Since some of the key people involved in this have not chipped into this discussion, either we are reasonably correct in our ideas, or so wrong that they don't know where to begin correcting us ;-) This is what makes me push for the sorted approach - relational databases are doing this. I don't know how they do it, but they are, and we should be at least as good. Absolutely. We should be as good. Relational database manage to serve a page deep down the list quite fast. I must believe if they had to complete the traversal, sort the results and extract the page on every single page request, they could not be so fast. I think my ideas for the traversal are 'supposed' to be performance enhancements, and that is why I like them ;-) I agree the issue of what should be indexed to optimize sorting is a domain-specific problem, but I think that is how relational databases treat it as well. If you want sorting to be fast, you have to tell them to index the field you will be sorting on. The only difference contra having the user put the sorting index in the graph is that relational databases will handle the indexing for you, saving you a *ton* of work, and I think we should too. Yes. I was discussing automatic indexing with Mattias recently. I think (and hope I am right), that once we move to automatic indexes, then it will be possible to put
Re: [Neo4j] REST results pagination
Good dialog, btw! - Reply message - From: Jacob Hansson ja...@voltvoodoo.com Date: Thu, Apr 21, 2011 10:06 am Subject: [Neo4j] REST results pagination To: Neo4j user discussions user@lists.neo4j.org On Thu, Apr 21, 2011 at 2:59 PM, Rick Bullotta rick.bullo...@thingworx.comwrote: Fwiw, I think paging is an outdated crutch, for a few reasons: 1) bandwidth and browser processing/parsing are largely non issues, although they used to be I disagree. They have improved significantly, for sure, but that is no reason to download massive amounts of data that will never be used. 2) human users rarely have the patience (and usability sucks) to go beyond 2-4 pages of information. It is far better to allow incrementally refined filters and searches to get to a workable subset of data. I agree with the suckiness of paging and the awesomeness of filtering - but what do you do when the users filter returns 40 million results? You somehow have to tell the user that damn, that filter, it returned forty freaking million results, you need to refine your search buddy. The way the user expects that to happen is through presenting a paged, infinite scrolled or similar interface, where she can see how many results where returned and act on that feedback. 3) machine users could care less about paging Agreed, streaming is a much better way for machines to talk about data that doesn't fit in memory. 4) when doing visualization of a large dataset, you generally want the whole dataset, not a page of it, so that's another non use case Not necessarily true. You need all the data that you want to visualize, but that is not necessarily all the data the user has asked for. You can be clever about the visualization to keep it uncluttered, and paging-like behaviours may be a way to do that. Discuss and debate please! Rick - Reply message - From: Craig Taverner cr...@amanzi.com Date: Thu, Apr 21, 2011 8:52 am Subject: [Neo4j] REST results pagination To: Neo4j user discussions user@lists.neo4j.org I assume this: Traverser x = Traversal.description().traverse( someNode ); x.nodes(); x.nodes(); // Not necessarily in the same order as previous call. If that assumption is false or there is some workaround, then I agree that this is a valid approach, and a good efficient alternative when sorting is not relevant. Glancing at the code in TraverserImpl though, it really looks like the call to .nodes will re-run the traversal, and I thought that would mean the two calls can yield results in different order? OK. My assumptions were different. I assume that while the order is not easily predictable, it is reproducable as long as the underlying graph has not changed. If the graph changes, then the order can change also. But I think this is true of a relational database also, is it not? So, obviously pagination is expected (by me at least) to give page X as it is at the time of the request for page X, not at the time of the request for page 1. But my assumptions could be incorrect too... I understand, and completely agree. My problem with the approach is that I think its harder than it looks at first glance. I guess I cannot argue that point. My original email said I did not know if this idea had been solved yet. Since some of the key people involved in this have not chipped into this discussion, either we are reasonably correct in our ideas, or so wrong that they don't know where to begin correcting us ;-) This is what makes me push for the sorted approach - relational databases are doing this. I don't know how they do it, but they are, and we should be at least as good. Absolutely. We should be as good. Relational database manage to serve a page deep down the list quite fast. I must believe if they had to complete the traversal, sort the results and extract the page on every single page request, they could not be so fast. I think my ideas for the traversal are 'supposed' to be performance enhancements, and that is why I like them ;-) I agree the issue of what should be indexed to optimize sorting is a domain-specific problem, but I think that is how relational databases treat it as well. If you want sorting to be fast, you have to tell them to index the field you will be sorting on. The only difference contra having the user put the sorting index in the graph is that relational databases will handle the indexing for you, saving you a *ton* of work, and I think we should too. Yes. I was discussing automatic indexing with Mattias recently. I think (and hope I am right), that once we move to automatic indexes, then it will be possible to put external indexes (a'la lucene) and graph indexes (like the ones I favour) behind the same API. In this case perhaps the database will more easily be able to make the right optimized decisions, and use the index for providing sorted results fast
Re: [Neo4j] REST results pagination
This is indeed a good dialogue. The pagination versus streaming was something I'd previously had in my mind as orthogonal issues, but I like the direction this is going. Let's break it down to fundamentals: As a remote client, I want to be just as rich and performant as a local client. Unfortunately, Deutsch, Amdahl and Einstein are against me on that, and I don't think I am tough enough to defeat those guys. So what are my choices? I know I have to be more granular to try to alleviate some of the network penalty so doing operations bulkily sounds great. Now what I need to decide is whether I control the rate at which those bulk operations occur or whether the server does. If I want to control those operations, then paging seems sensible. Otherwise a streamed (chunked) encoding scheme would make sense if I'm happy for the server to throw results back at me at its own pace. Or indeed you can mix both so that pages are streamed. In either case if I get bored of those results, I'll stop paging or I'll terminate the connection. So what does this mean for implementation on the server? I guess this is important since it affects the likelihood of the Neo Tech team implementing it. If the server supports pagination, it means we need a paging controller in memory per paginated result set being created. If we assume that we'll only go forward in pages, that's effectively just a wrapper around the traversal that's been uploaded. The overhead should be modest, and apart from the paging controller and the traverser, it doesn't need much state. We would need to add some logic to the representation code to support next links, but that seems a modest task. If the server streams, we will need to decouple the representation generation from the existing representation logic since that builds an in-memory representation which is then flushed. Instead we'll need a streaming representation implementation which seems to be a reasonable amount of engineering. We'll also need a new streaming binding to the REST server in JAX-RS land. I'm still a bit concerned about how rude it is for a client to just drop a streaming connection. I've asked Mark Nottingham for his authoritative opinion on that. But still, this does seem popular and feasible. Jim ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] REST results pagination
I can only think of a few use cases where loosing some of the expected result is ok, for instance if you want to peek at the result. IMHO, paging is, by definition, a peek. Since the client controls when the next page will be requested, it is not possible, or reasonable, to enforce that the complete set of pages (if every requested) will represent a consistent result set. This is not supported by relational databases either. The result set, and meaning of a page, can change between requests. So it can, and does happen, data some of the expected result is lost. This is completely different to the streaming result, which I see Jim commented on, and so I might just reply to his mail too :-) I'm waiting for one of those SlapOnTheFingersExceptions' that Tobias has been handing out :) My fingers are, as yet, unscathed. The slap can come at any moment! :-) This sounds really cool, would be a great thing to look into! Should you want examples, I have a wiki page on this topic at http://redmine.amanzi.org/wiki/geoptima/Geoptima_Event_Log http://redmine.amanzi.org/wiki/geoptima/Geoptima_Event_Log ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] REST results pagination
I think Jim makes a great point about the differences between paging and streaming, being client or server controlled. I think there is a related point to be made, and that is that paging does not, and cannot, guarantee a consistent total result set. Since the database can change between pages requests, they can be inconsistent. It is possible for the same record to appear in two pages, or for a record to be missed. This is certainly how relational databases work in this regard. But in the streaming case, we expect a complete and consistent result set. Unless, of course, the client cuts off the stream. The use case is very different, while paging is about getting a peek at the data, and rarely about paging all the way to the end, streaming is about getting the entire result, but streamed for efficiency. On Thu, Apr 21, 2011 at 5:00 PM, Jim Webber j...@neotechnology.com wrote: This is indeed a good dialogue. The pagination versus streaming was something I'd previously had in my mind as orthogonal issues, but I like the direction this is going. Let's break it down to fundamentals: As a remote client, I want to be just as rich and performant as a local client. Unfortunately, Deutsch, Amdahl and Einstein are against me on that, and I don't think I am tough enough to defeat those guys. So what are my choices? I know I have to be more granular to try to alleviate some of the network penalty so doing operations bulkily sounds great. Now what I need to decide is whether I control the rate at which those bulk operations occur or whether the server does. If I want to control those operations, then paging seems sensible. Otherwise a streamed (chunked) encoding scheme would make sense if I'm happy for the server to throw results back at me at its own pace. Or indeed you can mix both so that pages are streamed. In either case if I get bored of those results, I'll stop paging or I'll terminate the connection. So what does this mean for implementation on the server? I guess this is important since it affects the likelihood of the Neo Tech team implementing it. If the server supports pagination, it means we need a paging controller in memory per paginated result set being created. If we assume that we'll only go forward in pages, that's effectively just a wrapper around the traversal that's been uploaded. The overhead should be modest, and apart from the paging controller and the traverser, it doesn't need much state. We would need to add some logic to the representation code to support next links, but that seems a modest task. If the server streams, we will need to decouple the representation generation from the existing representation logic since that builds an in-memory representation which is then flushed. Instead we'll need a streaming representation implementation which seems to be a reasonable amount of engineering. We'll also need a new streaming binding to the REST server in JAX-RS land. I'm still a bit concerned about how rude it is for a client to just drop a streaming connection. I've asked Mark Nottingham for his authoritative opinion on that. But still, this does seem popular and feasible. Jim ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] REST results pagination
Jim, we should schedule a group chat on this topic. - Reply message - From: Jim Webber j...@neotechnology.com Date: Thu, Apr 21, 2011 11:01 am Subject: [Neo4j] REST results pagination To: Neo4j user discussions user@lists.neo4j.org This is indeed a good dialogue. The pagination versus streaming was something I'd previously had in my mind as orthogonal issues, but I like the direction this is going. Let's break it down to fundamentals: As a remote client, I want to be just as rich and performant as a local client. Unfortunately, Deutsch, Amdahl and Einstein are against me on that, and I don't think I am tough enough to defeat those guys. So what are my choices? I know I have to be more granular to try to alleviate some of the network penalty so doing operations bulkily sounds great. Now what I need to decide is whether I control the rate at which those bulk operations occur or whether the server does. If I want to control those operations, then paging seems sensible. Otherwise a streamed (chunked) encoding scheme would make sense if I'm happy for the server to throw results back at me at its own pace. Or indeed you can mix both so that pages are streamed. In either case if I get bored of those results, I'll stop paging or I'll terminate the connection. So what does this mean for implementation on the server? I guess this is important since it affects the likelihood of the Neo Tech team implementing it. If the server supports pagination, it means we need a paging controller in memory per paginated result set being created. If we assume that we'll only go forward in pages, that's effectively just a wrapper around the traversal that's been uploaded. The overhead should be modest, and apart from the paging controller and the traverser, it doesn't need much state. We would need to add some logic to the representation code to support next links, but that seems a modest task. If the server streams, we will need to decouple the representation generation from the existing representation logic since that builds an in-memory representation which is then flushed. Instead we'll need a streaming representation implementation which seems to be a reasonable amount of engineering. We'll also need a new streaming binding to the REST server in JAX-RS land. I'm still a bit concerned about how rude it is for a client to just drop a streaming connection. I've asked Mark Nottingham for his authoritative opinion on that. But still, this does seem popular and feasible. Jim ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] REST results pagination
Really cool discussion so far, I would also prefer streaming over paging as with that approach we can give both ends more of the control they need. The server doesn't have to keep state over a long time (and also implement timeouts and clearing of that state, and keeping that state for lots of clients also adds up). The client can decide how much of the result he's interested in, if it is just 1 entry or 100k and then just drop the connection. Streaming calls can also have a request-timeout, so keeping those open for too long (with no activity) will close them automatically. Server doesn't use up lots of memory for streaming, one could even leverage the lazyness of traversers (and indexes) for not even executing/fetching results that are not going to be sent over the wire. This should accommodate every kind of client from the mobile phone which only lists a few entries, to the big machine that can eat a firehose of result data in milliseconds. For this kind of look-ahead support we could (and should) add an possible offset, so that a client can request data (whose order _he_ is sure hasn't changed) by having the server skipping the first n entries (so they don't have to be serialized/put on the wire). I also think that this streaming API could already address many of the pain-points of the current REST API. Perhaps we even want to provide a streaming interface in both directions, having the client being able to for instance stream the creation of nodes and relationships and their indexing without restarting a connection for each operation. Whatever comes in this stream could also be processed in one TX (or with TX tokens embedded in the stream the client could even control that). The only question that is posing here for me is if we want to put it on top of the existing REST API or rather create a more concise API/formats for that (with the later option of the format even degrading to binary for high bandwith interaction). I'd prefer the latter. Cheers Michael Am 21.04.2011 um 21:09 schrieb Rick Bullotta: Jim, we should schedule a group chat on this topic. - Reply message - From: Jim Webber j...@neotechnology.com Date: Thu, Apr 21, 2011 11:01 am Subject: [Neo4j] REST results pagination To: Neo4j user discussions user@lists.neo4j.org This is indeed a good dialogue. The pagination versus streaming was something I'd previously had in my mind as orthogonal issues, but I like the direction this is going. Let's break it down to fundamentals: As a remote client, I want to be just as rich and performant as a local client. Unfortunately, Deutsch, Amdahl and Einstein are against me on that, and I don't think I am tough enough to defeat those guys. So what are my choices? I know I have to be more granular to try to alleviate some of the network penalty so doing operations bulkily sounds great. Now what I need to decide is whether I control the rate at which those bulk operations occur or whether the server does. If I want to control those operations, then paging seems sensible. Otherwise a streamed (chunked) encoding scheme would make sense if I'm happy for the server to throw results back at me at its own pace. Or indeed you can mix both so that pages are streamed. In either case if I get bored of those results, I'll stop paging or I'll terminate the connection. So what does this mean for implementation on the server? I guess this is important since it affects the likelihood of the Neo Tech team implementing it. If the server supports pagination, it means we need a paging controller in memory per paginated result set being created. If we assume that we'll only go forward in pages, that's effectively just a wrapper around the traversal that's been uploaded. The overhead should be modest, and apart from the paging controller and the traverser, it doesn't need much state. We would need to add some logic to the representation code to support next links, but that seems a modest task. If the server streams, we will need to decouple the representation generation from the existing representation logic since that builds an in-memory representation which is then flushed. Instead we'll need a streaming representation implementation which seems to be a reasonable amount of engineering. We'll also need a new streaming binding to the REST server in JAX-RS land. I'm still a bit concerned about how rude it is for a client to just drop a streaming connection. I've asked Mark Nottingham for his authoritative opinion on that. But still, this does seem popular and feasible. Jim ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] REST results pagination
Half-baked thoughts from a neo4j newbie hacker type on this topic: 1) I think it is very important, even with modern infrastructures, for the client to be able to optionally throttle the result set it generates with a query as it sees fit, and not just because of client memory and bandwidth limitations. With regular old SQL databases if you send a careless large query, you can chew up significant system resources, for significant amounts of time while it is being processed. At a minimum, a rowcount/pagination option allows you to build something into your client which can minimize accidental denial of service queries. I'm not sure if it is possible to construct a query against a large Neo4j database that would temporarily cripple it, but it wouldn't surprise me if you could. 2) Sometimes with regular old SQL databases I'll run a sanity check count() function with the query to just return the size of the expected result set before I try to pull it back into my data structure. Many times count() is all I needed anyhow. Does Neo4j have a result set size function? Perhaps a client that really could only handle small result sets could run a count(), and then filter the search somehow, if necessary, until the count() was smaller? (I guess it would depend on the problem domain...) In other words it may be possible, when it is really important, to implement pagination logic on the client side, if you don't mind running multiple queries for each set of data you get back. 3) If the result set was broken into pages, you could organize the pages in the server with a set of [temporary] graph nodes with relationships to the results in the database -- one node for each page, and a parent node for the result set. If order of the pages is important, you could add directed relationships between the page nodes. If the order within the pages is important you could either apply a sequence numbering to the page-result relationship, or add directed temporary result set directed relationships too. Subsequent page retrievals would be new traversals based on the search result set graph. In a sense you would be building a temporary graph-index I suppose. And advantage to organizing search result sets this way is that you could then union and intersect result sets (and do other set operations) without a huge memory overhead. (Which means you could probably store millions of search results at one time, and you could persist them through restarts.) 4) In some HA architectures you may have multiple database copies behind a load balancer. Would the search result pages be stored equally on all of them? Would the client require a sticky flag, to always go back to the same specific server instance for more pages? Depending on how fast writes get propagated across the cluster (compared to requests for the next page), if you were creating nodes as described in (3) would that work? 5) As for sorting: In my experience, if I need a result set sorted from a regular SQL database, I will usually sort it myself. Most databases I've ever worked with routinely have performance problems. You can minimize finger pointing and the risk of complicating those other performance problems by just directing the database to get me what I need, I'll do the rest of it back in the client. On the other hand, sometimes it is quicker and easier to let the database do the work. (Usually when I can only handle the data in small chunks on the client.) What I'm trying to say, is that I think sorting is going to be more important to clients who want paginated results (ie, using resource limited clients), than to clients who are grabbing large chunks of data at a time (and will want to own any post-query processing steps anyhow). -- Rick Otten rot...@windfish.net O=='=+ ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] REST results pagination
Rick, great thoughts. Good catch, forgot to add the in-graph representation of the results to my mail, thanks for adding that part. Temporary (transient) nodes and relationships would really rock here, with the advantage that with HA you have them distributed to all cluster nodes. Certainly Craig has to add some interesting things to this, as those resemble probably his in graph indexes / R-Trees. As traversers are lazy a count operation is not so easily possible, you could run the traversal and discard the results. But then the client could also just pull those results until it reaches its internal tresholds and then decide to use more filtering or stop the pulling and ask the user for more filtering (you can always retrieve n+1 and show the user that there are more that n results available). The index result size() method only returns an estimate of the result size (which might not contain currently changed index entries). Please don't forget that a count() query in a RDBMS can be as ridicully expensive as the original query (especially if just the column selection was replaced with count, and sorting, grouping etc was still left in place together with lots of joins). Sorting on your own instead of letting the db do that mostly harms the performance as it requires you to build up all the data in memory, sort it and then use it. Instead of having the db do that more efficiently, stream the data and you can use it directly from the stream. Cheers Michael Am 21.04.2011 um 23:04 schrieb Rick Otten: Half-baked thoughts from a neo4j newbie hacker type on this topic: 1) I think it is very important, even with modern infrastructures, for the client to be able to optionally throttle the result set it generates with a query as it sees fit, and not just because of client memory and bandwidth limitations. With regular old SQL databases if you send a careless large query, you can chew up significant system resources, for significant amounts of time while it is being processed. At a minimum, a rowcount/pagination option allows you to build something into your client which can minimize accidental denial of service queries. I'm not sure if it is possible to construct a query against a large Neo4j database that would temporarily cripple it, but it wouldn't surprise me if you could. 2) Sometimes with regular old SQL databases I'll run a sanity check count() function with the query to just return the size of the expected result set before I try to pull it back into my data structure. Many times count() is all I needed anyhow. Does Neo4j have a result set size function? Perhaps a client that really could only handle small result sets could run a count(), and then filter the search somehow, if necessary, until the count() was smaller? (I guess it would depend on the problem domain...) In other words it may be possible, when it is really important, to implement pagination logic on the client side, if you don't mind running multiple queries for each set of data you get back. 3) If the result set was broken into pages, you could organize the pages in the server with a set of [temporary] graph nodes with relationships to the results in the database -- one node for each page, and a parent node for the result set. If order of the pages is important, you could add directed relationships between the page nodes. If the order within the pages is important you could either apply a sequence numbering to the page-result relationship, or add directed temporary result set directed relationships too. Subsequent page retrievals would be new traversals based on the search result set graph. In a sense you would be building a temporary graph-index I suppose. And advantage to organizing search result sets this way is that you could then union and intersect result sets (and do other set operations) without a huge memory overhead. (Which means you could probably store millions of search results at one time, and you could persist them through restarts.) 4) In some HA architectures you may have multiple database copies behind a load balancer. Would the search result pages be stored equally on all of them? Would the client require a sticky flag, to always go back to the same specific server instance for more pages? Depending on how fast writes get propagated across the cluster (compared to requests for the next page), if you were creating nodes as described in (3) would that work? 5) As for sorting: In my experience, if I need a result set sorted from a regular SQL database, I will usually sort it myself. Most databases I've ever worked with routinely have performance problems. You can minimize finger pointing and the risk of complicating those other performance problems by just directing the database to get me what I need, I'll do the rest of it back in the client. On the other hand, sometimes it
Re: [Neo4j] REST results pagination
Hi Javier, what would you need that for? I'm interested in the usecase. Cheers Michael Am 20.04.2011 um 06:17 schrieb Javier de la Rosa: On Tue, Apr 19, 2011 at 10:25, Jim Webber j...@neotechnology.com wrote: I've just checked and that's in our list of stuff we really should do because it annoys us that it's not there. No promises, but we do intend to work through at least some of that list for the 1.4 releases. If this finally is developed, it will possible to request for all nodes and all relationships in some URL? Jim ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user -- Javier de la Rosa http://versae.es ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] REST results pagination
Data export, e.g. dumping everything as CSV, DOT or RDF? On 20 April 2011 18:33, Michael Hunger michael.hun...@neotechnology.comwrote: Hi Javier, what would you need that for? I'm interested in the usecase. Cheers Michael Am 20.04.2011 um 06:17 schrieb Javier de la Rosa: On Tue, Apr 19, 2011 at 10:25, Jim Webber j...@neotechnology.com wrote: I've just checked and that's in our list of stuff we really should do because it annoys us that it's not there. No promises, but we do intend to work through at least some of that list for the 1.4 releases. If this finally is developed, it will possible to request for all nodes and all relationships in some URL? Jim ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user -- Javier de la Rosa http://versae.es ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] REST results pagination
wont dumping a graph database as a tabular format create a huge file even if the number of nodes are few !!! (for a highly interconnected graph) On 4/20/2011 2:41 AM, Tim McNamara wrote: Data export, e.g. dumping everything as CSV, DOT or RDF? On 20 April 2011 18:33, Michael Hungermichael.hun...@neotechnology.comwrote: Hi Javier, what would you need that for? I'm interested in the usecase. Cheers Michael Am 20.04.2011 um 06:17 schrieb Javier de la Rosa: On Tue, Apr 19, 2011 at 10:25, Jim Webberj...@neotechnology.com wrote: I've just checked and that's in our list of stuff we really should do because it annoys us that it's not there. No promises, but we do intend to work through at least some of that list for the 1.4 releases. If this finally is developed, it will possible to request for all nodes and all relationships in some URL? Jim ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user -- Javier de la Rosa http://versae.es ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] REST results pagination
On Tue, Apr 19, 2011 at 10:17 PM, Michael DeHaan michael.deh...@gmail.comwrote: On Tue, Apr 19, 2011 at 10:58 AM, Jim Webber j...@neotechnology.com wrote: I'd like to propose that we put this functionality into the plugin ( https://github.com/skanjila/gremlin-translation-plugin) that Peter and I are currently working on, thoughts? I'm thinking that, if we do it, it should be handled through content negotiation. That is if you ask for application/atom then you get paged lists of results. I don't necessarily think that's a plugin, it's more likely part of the representation logic in server itself. This is something I've been wondering about as I may have the need to feed very large graphs into the system and am wondering how the REST API will hold up compared to the native interface. What happens if the result of an index query (or traversal, whatever) legitimately needs to return 100k results? Wouldn't that be a bit large for one request? If anything, it's a lot of JSON to decode at once. Yeah, we can't do this right now, and implementing it is harder than it seems at first glance, since we first need to implement sorting of results, otherwise the paged result will be useless. Like Jim said though, this is another one of those *must be done* features. Feeds make sense for things that are feed-like, but do atom feeds really make sense for results of very dynamic queries that don't get subscribed to? Or, related question, is there a point where the result sets of operations get so large that things start to break down? What do people find this to generally be? I'm sure there are some awesome content types out there that we can look at that will fit our uses, I don't feel confident to say if Atom is a good choice, I've never worked with it.. The point where this breaks down I'm gonna guess is in server-side serialization, because we currently don't stream the serialized data, but build it up in memory and ship it off when it's done. I'd say you'll run out of memory after 1 nodes or so on a small server, which I think underlines how important this is to fix. Maybe it's not an issue, but pointers to any problems REST API usage has with large data sets (and solutions?) would be welcome. Not aware of anyone bumping into these limits yet, but I'm sure we'll start hearing about it.. The only current solution I can think of is a server plugin that emulates this, but it would have to sort the result, and I'm afraid that it will be hard (probably not impossible, but hard) to implement that in a memory-efficient way that far away from the kernel. You may just end up moving the OutOfMemeoryExceptions' to the plugin instead of the serialization system. --Michael ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user -- Jacob Hansson Phone: +46 (0) 763503395 Twitter: @jakewins ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] REST results pagination
Wow, had I known the number of replies, I had sent the e-mail much before ;) Sorting is a very cool feature. I didn't know the hard core implications of pagination. The only think I want is to avoid the overload of sending thousands nodes through HTTP in JSON. Actually, with a workaround that splits results (using offset and limit) in the server without a real cut of the them, would be enough for me. Sending in JSON only a concrete number of results. If this feature could be in the core of Neo4j, perfect :-) On Wed, Apr 20, 2011 at 08:01, Jacob Hansson ja...@voltvoodoo.com wrote: On Wed, Apr 20, 2011 at 11:25 AM, Craig Taverner cr...@amanzi.com wrote: I think sorting would need to be optional, since it is likely to be a performance and memory hug on large traversals. I think one of the key benefits of the traversal framework in the Embedded API is being able to traverse and 'stream' a very large graph without occupying much memory. If this can be achieved in the REST API (through pagination), that is a very good thing. I assume the main challenge is being able to freeze a traverser and keep it on hold between client requests for the next page. Perhaps you have already solved that bit? While I agree with you that the ability to effectively stream the results of a traversal is a very useful thing, I don't like the persisted traverser approach, for several reasons. I'm sorry if my tone below is a bit harsh, I don't mean it that way, I simply want to make a strong case for why I think the hard way is the right way in this case. First, the only good restful approach I can think of for doing persisted traversals would be to create a traversal resource (since it is an object that keeps persistent state), and get back an id to refer to it. Subsequent calls to paged results would then be to that traversal resource, updating its state and getting results back. Assuming this is the correct way to implement this, it comes with a lot of questions. Should there be a timeout for these resources, or is the user responsible for removing them from memory? What happens when the server crashes and the client can't find the traversal resources it has ids for? If we somehow solve that or find some better approach, we end up with an API where a client can get paged results, but two clients performing the same traversal on the same data may get back the same result in different order (see my comments on sorting based on expected traversal behaviour below). This means that the API is really only useful if you actually want to get the entire result back. If that was the problem we wanted to solve, a streaming solution is a much easier and faster approach than a paging solution. Second, being able to iterate over the entire result set is only half of the use cases we are looking to solve. The other half are the ones I mentioned examples of (the blog case, presenting lists of things to users and so on), and those are not solved by this. Forcing users of our database to pull out all their data over the wire and sort the whole thing, only to keep the first 10 items, for each user that lands on their frontpage, is not ok. Third, and most importantly to me, using this case to put more pressure on ourselves to implement real sorting is a really good thing. Sorting is something that *really* should be provided by us, anyone who has used a modern database expects this to be our problem to solve. We have a really good starting point for optimizing sorting algorithms, sitting as we are inside the kernel with our caches and indexes :) In my opinion, I would code the sorting as a characteristic of the graph itself, in order to avoid having to sort in the server (and incur the memory/performance hit). So that means I would use a domain-specific solution to sorting. Of course, generic sorting is nice also, but make it optional. I agree sorting should be an opt-in feature. Putting meta-data like sorting order and similar things inside the graph I think is a matter of personal preference, and for sure has its place as a useful optimization. I do, however, think that the official approach to sorting needs to be based on concepts familiar from other databases - define your query, and define how you want the result sorted. If indexes are available the database can use them to optimize the sorting, otherwise it will suck, but at least we're doing what the user wants us to do. All lessons learned in YesSQL databases (see what I did there?) should not be unlearned :) Also, the approach of sorting via the traversal itself assumes knowledge of which order the traverser will move through the graph, and that is not necessarily something that will be the same in later releases. Tobias was talking about cache-first traversals as an addition or even a replacement to depth/breadth first ones, a major optimization we cannot do if we encourage people to sort inside the graph. /Jake
Re: [Neo4j] REST results pagination
Here is my motivation for this petition. In my ideal world, everything in Neo4j REST server that returns a list, should be something like RequestList or QuerySet, and supports pagination and even filtering with lookups, so I could do things like the next (sorry for the Python syntax, I'm always thinking of the Python rest client): gdb.nodes.all()[2:5] # Perform the query to server to get the nodes between the 2nd and 5th position, we assume Neo always returns ordered results in the same way. gdb.nodes.filter(name__contains=neo)[:10] # Returns only nodes with a property called name which contains neo, and returns only the first 10. This is important for the integration of the Neo4j Python Rest Client in Django, because I'm currently developing an application with lazy and user-defined schemas on top of Django and Neo4j. The listing of nodes and relationships is a requirement for me, so the pagination is a must in my aplication. Performing this in the application layer instead of Neo4j server side, wastes a lot of time sending information via REST. On Wed, Apr 20, 2011 at 03:43, Michael Hunger michael.hun...@neotechnology.com wrote: But wouldn't that really custom operation not more easily and much faster done as a server plugin? Otherwise all the data would have to be serialized to json and deserialized again and no streaming possible. From a server extension you could even stream and gzip that data with ease. Cheers Michael Am 20.04.2011 um 08:41 schrieb Tim McNamara: Data export, e.g. dumping everything as CSV, DOT or RDF? On 20 April 2011 18:33, Michael Hunger michael.hun...@neotechnology.comwrote: Hi Javier, what would you need that for? I'm interested in the usecase. Cheers Michael Am 20.04.2011 um 06:17 schrieb Javier de la Rosa: On Tue, Apr 19, 2011 at 10:25, Jim Webber j...@neotechnology.com wrote: I've just checked and that's in our list of stuff we really should do because it annoys us that it's not there. No promises, but we do intend to work through at least some of that list for the 1.4 releases. If this finally is developed, it will possible to request for all nodes and all relationships in some URL? Jim ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user -- Javier de la Rosa http://versae.es ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user -- Javier de la Rosa http://versae.es ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] REST results pagination
This is important for the integration of the Neo4j Python Rest Client in Django, because I'm currently developing an application with lazy and user-defined schemas on top of Django and Neo4j. The listing of nodes and relationships is a requirement for me, so the pagination is a must in my aplication. Performing this in the application layer instead of Neo4j server side, wastes a lot of time sending information via REST. Well put about the listing of nodes and relationships. That's the use case where this comes up. If I can't trust that my app's code indexed something correctly, or I need to index old data later, I may need to walk the whole graph to update the indexes, so large result sets become scary. I don't think I can rely on a traverse as part of the graphs might be disjoint. New use cases on old data mean we'll have to do that, just like adding a new index to a SQL db. Or if I have an index that says all nodes of type, that result set could get very large. In fact, I probably need to access all nodes in order to apply any new indexes, if I can't just send a reindexing command that says for all nodes add to index like so, etc. If I'm understanding the server plugin thing correctly, I've got to go write some java classes to do that... which, while I *can* do, it would better if it could be accessed in a language agnostic way, with something more or less resembling a database cursor (see MongoDB's API). --Michael ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] REST results pagination
To respond to your arguments it would be worth noting a comment by Michael DeHaan later on in this thread. He asked for 'something more or less resembling a database cursor (see MongoDB's API).' The trick is to achieve this without having to store a lot of state on the server, so it is robust against server restarts or crashes. If we compare to the SQL situation, there are two numbers passed by the client, the page size and the offset. The state can be re-created by the database server entirely from this information. How this is implemented in a relational database I do not know, but whether the database is relational or a graph, certain behaviors would be expected, like robustness against database content changes between the requests, and coping with very long gaps between requests. In my opinion the database cursor could be achieved by both of the following approaches: - Starting the traversal from the beginning, and only returning results after passing the cursor offset position - Keeping a live traverser in the server, and continuing it from the previous position Personally I think the second approach is simply a performance optimization of the first. So robustness is achieved by having both, with the second one working when possible (no server restarts, timeout not expiring, etc.), and falling back to the first in other cases. This achieves performance and robustness. What we do not need to do with either case is keep an entire result set in memory between client requests. Now when you add sorting into the picture, then you need to generate the complete result-set in memory, sort, paginate and return only the requested page. If the entire process has to be repeated for every page requested, this could perform very badly for large result sets. I must believe that relational databases do not do this (but I do not know how they paginate sorted results, unless the sort order is maintained in an index). To avoid keeping everything in memory, or repeatedly reloading everything to memory on every page request, we need sorted results to be produced on the stream. This can be done by keeping the sort order in an index. This is very hard to do in a generic way, which is why I thought it best done in a domain specific way. Finally, I think we are really looking at two, different but valid use cases. The need for generic sorting combined with pagination, and the need for pagination on very large result sets. The former use case can work with re-traversing and sorting on each client request, is fully generic, but will perform badly on large result sets. The latter can perform adequately on large result sets, as long as you do not need to sort (and use the database cursor approach to avoid loading the result set into memory). On Wed, Apr 20, 2011 at 2:01 PM, Jacob Hansson ja...@voltvoodoo.com wrote: On Wed, Apr 20, 2011 at 11:25 AM, Craig Taverner cr...@amanzi.com wrote: I think sorting would need to be optional, since it is likely to be a performance and memory hug on large traversals. I think one of the key benefits of the traversal framework in the Embedded API is being able to traverse and 'stream' a very large graph without occupying much memory. If this can be achieved in the REST API (through pagination), that is a very good thing. I assume the main challenge is being able to freeze a traverser and keep it on hold between client requests for the next page. Perhaps you have already solved that bit? While I agree with you that the ability to effectively stream the results of a traversal is a very useful thing, I don't like the persisted traverser approach, for several reasons. I'm sorry if my tone below is a bit harsh, I don't mean it that way, I simply want to make a strong case for why I think the hard way is the right way in this case. First, the only good restful approach I can think of for doing persisted traversals would be to create a traversal resource (since it is an object that keeps persistent state), and get back an id to refer to it. Subsequent calls to paged results would then be to that traversal resource, updating its state and getting results back. Assuming this is the correct way to implement this, it comes with a lot of questions. Should there be a timeout for these resources, or is the user responsible for removing them from memory? What happens when the server crashes and the client can't find the traversal resources it has ids for? If we somehow solve that or find some better approach, we end up with an API where a client can get paged results, but two clients performing the same traversal on the same data may get back the same result in different order (see my comments on sorting based on expected traversal behaviour below). This means that the API is really only useful if you actually want to get the entire result back. If that was the problem we wanted to solve, a streaming solution is a much easier and faster
Re: [Neo4j] REST results pagination
Hi Javier, I've just checked and that's in our list of stuff we really should do because it annoys us that it's not there. No promises, but we do intend to work through at least some of that list for the 1.4 releases. Jim ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] REST results pagination
I'd like to propose that we put this functionality into the plugin (https://github.com/skanjila/gremlin-translation-plugin) that Peter and I are currently working on, thoughts? From: j...@neotechnology.com Date: Tue, 19 Apr 2011 15:25:20 +0100 To: user@lists.neo4j.org Subject: Re: [Neo4j] REST results pagination Hi Javier, I've just checked and that's in our list of stuff we really should do because it annoys us that it's not there. No promises, but we do intend to work through at least some of that list for the 1.4 releases. Jim ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] REST results pagination
On Tue, Apr 19, 2011 at 10:32, Saikat Kanjilal sxk1...@hotmail.com wrote: I'd like to propose that we put this functionality into the plugin (https://github.com/skanjila/gremlin-translation-plugin) that Peter and I are currently working on, thoughts? +1 From: j...@neotechnology.com I've just checked and that's in our list of stuff we really should do because it annoys us that it's not there. No promises, but we do intend to work through at least some of that list for the 1.4 releases. It will be great to see the feature in the 1.4 :-) -- Javier de la Rosa http://versae.es ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] REST results pagination
I'd like to propose that we put this functionality into the plugin (https://github.com/skanjila/gremlin-translation-plugin) that Peter and I are currently working on, thoughts? I'm thinking that, if we do it, it should be handled through content negotiation. That is if you ask for application/atom then you get paged lists of results. I don't necessarily think that's a plugin, it's more likely part of the representation logic in server itself. Jim ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] REST results pagination
On Tue, Apr 19, 2011 at 10:25, Jim Webber j...@neotechnology.com wrote: I've just checked and that's in our list of stuff we really should do because it annoys us that it's not there. No promises, but we do intend to work through at least some of that list for the 1.4 releases. If this finally is developed, it will possible to request for all nodes and all relationships in some URL? Jim ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user -- Javier de la Rosa http://versae.es ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user