Re: [Neo4j] REST results pagination

2011-04-26 Thread Jacob Hansson
Here at 49 responses, I'd like to reiterate Craigs point earlier that we are
really talking about several separate issues, and I'm wondering if we should
split this discussion up, because it is getting very hard to follow it.

As I see it, we are looking at* *three things:
*
Paging**
**Use case: *UI code that presents a paged, infinite scrolled or similar
interface to the user. Peeking at results for debugging or other purposes.
*
Streaming**
**Use case: *Returning huge data sets without killing anyone.
*
Sorting
**Use case: *Presenting lists of things to users, applications that care
about the order of results for some other reason.


I think we've come to agree that streaming and paging serve similar but
different purposes and are not quite able to replace each others
functionality.

I also took the liberty to elevate sorting to it's own topic, because I
believe it should be a generic thing you can do on a result set, whereas
paging and streaming are different means of returning a result set.



If we want to continue this discussion, would anyone object to splitting it
into these three parts?

/Jake

On Sun, Apr 24, 2011 at 2:18 AM, Rick Bullotta
rick.bullo...@thingworx.comwrote:

 Let's discuss sometime soon.  Creating resources that need to be cached or
 saved in session state bring with them a whole bunch of negative aspects...



 - Reply message -
 From: Michael Hunger michael.hun...@neotechnology.com
 Date: Fri, Apr 22, 2011 10:57 pm
 Subject: [Neo4j] REST results pagination
 To: Neo4j user discussions user@lists.neo4j.org

 I spent some time looking at what others are doing for inspiration.

 I kind of like the Riak/Basho approach with multipart-chunks and the
 approach of explictely creating a resource for the query that can be
 navigated (either via pages or first,next,[prev,last] links) and expires
 (and could be reconstructed).

 Cheers

 Michael

 Good discussion:
 http://stackoverflow.com/questions/924472/paging-in-a-rest-collection

 CouchDB:
 http://wiki.apache.org/couchdb/HTTP_Document_API
 startKey + limit, endKey + limit, sorting, insert/update order

 Mongooese: [cursor-id]+batch_size


 OrientDB: .../[limit]

 Sones: no real rest API, but a SQL on top of the graph:
 http://developers.sones.de/documentation/graph-query-language/select/
 with limit, offset, but also depth (for graph)

 HBase explcitly creates scanners, which can be then access with next
 operations, and expire after no activity for a certain timeout


 riak:
 http://wiki.basho.com/REST-API.html
  client-id header for client identification - sticky?
 optional query parameters for including properties, and if to stream the
 data keys=[true,false,stream]

 If “keys=stream”, the response will be transferred using chunked-encoding,
 where each chunk is a JSON object. The first chunk will contain the “props”
 entry (if props was not set to false). Subsequent chunks will contain
 individual JSON objects with the “keys” entry containing a sublist of the
 total keyset (some sublists may be empty).
 riak seems to support partial json, non closed elements: -d
 '{props:{n_val:5'

 returns multiple responses in one go, Content-Type: multipart/mixed;
 boundary=YinLMzyUR9feB17okMytgKsylvh

 --YinLMzyUR9feB17okMytgKsylvh
 Content-Type: application/x-www-form-urlencoded
 Link: /riak/test; rel=up
 Etag: 16vic4eU9ny46o4KPiDz1f
 Last-Modified: Wed, 10 Mar 2010 18:01:06 GMT

 {bar:baz}
 (this block can be repeated n times)
 --YinLMzyUR9feB17okMytgKsylvh--
 * Connection #0 to host 127.0.0.1 left intact
 * Closing connection #0

 Query results:
 Content-Type – always multipart/mixed, with a boundary specified
 Understanding the response body

 The response body will always be multipart/mixed, with each chunk
 representing a single phase of the link-walking query. Each phase will also
 be encoded in multipart/mixed, with each chunk representing a single object
 that was found. If no objects were found or “keep” was not set on the phase,
 no chunks will be present in that phase. Objects inside phase results will
 include Location headers that can be used to determine bucket and key. In
 fact, you can treat each object-chunk similarly to a complete response from
 read object, without the status code.
  HTTP/1.1 200 OK
  Server: MochiWeb/1.1 WebMachine/1.6 (eat around the stinger)
  Expires: Wed, 10 Mar 2010 20:24:49 GMT
  Date: Wed, 10 Mar 2010 20:14:49 GMT
  Content-Type: multipart/mixed; boundary=JZi8W8pB0Z3nO3odw11GUB4LQCN
  Content-Length: 970
 

 --JZi8W8pB0Z3nO3odw11GUB4LQCN
 Content-Type: multipart/mixed; boundary=OjZ8Km9J5vbsmxtcn1p48J91cJP

 --OjZ8Km9J5vbsmxtcn1p48J91cJP
 Content-Type: application/json
 Etag: 3pvmY35coyWPxh8mh4uBQC
 Last-Modified: Wed, 10 Mar 2010 20:14:13 GMT

 {riak:CAP}
 --OjZ8Km9J5vbsmxtcn1p48J91cJP--

 --JZi8W8pB0Z3nO3odw11GUB4LQCN
 Content-Type: multipart/mixed; boundary=RJKFlAs9PrdBNfd74HANycvbA8C

 --RJKFlAs9PrdBNfd74HANycvbA8C
 Location: /riak/test/doc2
 Content-Type: application/json
 Etag

Re: [Neo4j] REST results pagination

2011-04-26 Thread Jim Webber
In addition to what Jake just said about splitting this thread apart, I'd like 
to bring up what Rick suggested about getting together to thrash this out.

Can you guys think about when we might want a skype call for this? We have to 
take into account timezones to cover from CET through to PST (unless anyone is 
even further east?). 

Jim
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] REST results pagination

2011-04-26 Thread Justin Cormack
On Fri, 2011-04-22 at 17:43 +0100, Jim Webber wrote:
 Hi Michael,
 
  Just in case we're not talking about the same kind of streaming --
  when I think streaming, I think streaming uploads, streaming
  downloads, etc.
 
 I'm thinking chunked transfers. That is the server starts sending a 
 response and then eventually terminates it when the whole response has been 
 sent to the client.
 
 Although it seems a bit rude, the client could simply opt to close the 
 connection when it's read enough providing what it has read makes sense. 
 Sometimes document fragments can make sense:

 In this case we certainly don't have well-formed XML, but some streaming API 
 (e.g. stax) might already have been able to create some local objects on the 
 client side as the Earth and Mars nodes came in.
 
 I don't think this is elegant at all, but it might be practical. I've asked 
 Mark Nottingham for his view on this since he's pretty sensible about Web 
 things.

Any intermediate proxies would have to cache the whole thing; many
proxies are not designed for streaming responses so might read the whole
thing before relaying it (although they seem to be getting a bit better
at this with video over http). So the server would probably end up
generating the whole thing if there was a proxy in the path. 

I think its workable, but not sure it is ideal...

Justin


___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] REST results pagination

2011-04-23 Thread Rick Bullotta
Let's discuss sometime soon.  Creating resources that need to be cached or 
saved in session state bring with them a whole bunch of negative aspects...



- Reply message -
From: Michael Hunger michael.hun...@neotechnology.com
Date: Fri, Apr 22, 2011 10:57 pm
Subject: [Neo4j] REST results pagination
To: Neo4j user discussions user@lists.neo4j.org

I spent some time looking at what others are doing for inspiration.

I kind of like the Riak/Basho approach with multipart-chunks and the approach 
of explictely creating a resource for the query that can be navigated (either 
via pages or first,next,[prev,last] links) and expires (and could be 
reconstructed).

Cheers

Michael

Good discussion:
http://stackoverflow.com/questions/924472/paging-in-a-rest-collection

CouchDB:
http://wiki.apache.org/couchdb/HTTP_Document_API
startKey + limit, endKey + limit, sorting, insert/update order

Mongooese: [cursor-id]+batch_size


OrientDB: .../[limit]

Sones: no real rest API, but a SQL on top of the graph: 
http://developers.sones.de/documentation/graph-query-language/select/
with limit, offset, but also depth (for graph)

HBase explcitly creates scanners, which can be then access with next 
operations, and expire after no activity for a certain timeout


riak:
http://wiki.basho.com/REST-API.html
 client-id header for client identification - sticky?
optional query parameters for including properties, and if to stream the data 
keys=[true,false,stream]

If “keys=stream”, the response will be transferred using chunked-encoding, 
where each chunk is a JSON object. The first chunk will contain the “props” 
entry (if props was not set to false). Subsequent chunks will contain 
individual JSON objects with the “keys” entry containing a sublist of the total 
keyset (some sublists may be empty).
riak seems to support partial json, non closed elements: -d 
'{props:{n_val:5'

returns multiple responses in one go, Content-Type: multipart/mixed; 
boundary=YinLMzyUR9feB17okMytgKsylvh

--YinLMzyUR9feB17okMytgKsylvh
Content-Type: application/x-www-form-urlencoded
Link: /riak/test; rel=up
Etag: 16vic4eU9ny46o4KPiDz1f
Last-Modified: Wed, 10 Mar 2010 18:01:06 GMT

{bar:baz}
(this block can be repeated n times)
--YinLMzyUR9feB17okMytgKsylvh--
* Connection #0 to host 127.0.0.1 left intact
* Closing connection #0

Query results:
Content-Type – always multipart/mixed, with a boundary specified
Understanding the response body

The response body will always be multipart/mixed, with each chunk representing 
a single phase of the link-walking query. Each phase will also be encoded in 
multipart/mixed, with each chunk representing a single object that was found. 
If no objects were found or “keep” was not set on the phase, no chunks will be 
present in that phase. Objects inside phase results will include Location 
headers that can be used to determine bucket and key. In fact, you can treat 
each object-chunk similarly to a complete response from read object, without 
the status code.
 HTTP/1.1 200 OK
 Server: MochiWeb/1.1 WebMachine/1.6 (eat around the stinger)
 Expires: Wed, 10 Mar 2010 20:24:49 GMT
 Date: Wed, 10 Mar 2010 20:14:49 GMT
 Content-Type: multipart/mixed; boundary=JZi8W8pB0Z3nO3odw11GUB4LQCN
 Content-Length: 970


--JZi8W8pB0Z3nO3odw11GUB4LQCN
Content-Type: multipart/mixed; boundary=OjZ8Km9J5vbsmxtcn1p48J91cJP

--OjZ8Km9J5vbsmxtcn1p48J91cJP
Content-Type: application/json
Etag: 3pvmY35coyWPxh8mh4uBQC
Last-Modified: Wed, 10 Mar 2010 20:14:13 GMT

{riak:CAP}
--OjZ8Km9J5vbsmxtcn1p48J91cJP--

--JZi8W8pB0Z3nO3odw11GUB4LQCN
Content-Type: multipart/mixed; boundary=RJKFlAs9PrdBNfd74HANycvbA8C

--RJKFlAs9PrdBNfd74HANycvbA8C
Location: /riak/test/doc2
Content-Type: application/json
Etag: 6dQBm9oYA1mxRSH0e96l5W
Last-Modified: Wed, 10 Mar 2010 18:11:41 GMT

{foo:bar}
--RJKFlAs9PrdBNfd74HANycvbA8C--

--JZi8W8pB0Z3nO3odw11GUB4LQCN--
* Connection #0 to host 127.0.0.1 left intact
* Closing connection #0

Riak - MapReduce:
Optional query parameters:

* chunked – when set to true, results will be returned one at a time in 
multipart/mixed format using chunked-encoding.
Important headers:

* Content-Type – application/json when chunked is not true, 
otherwise multipart/mixed with application/json parts

Other interesting endpoints: /ping, /stats
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] REST results pagination

2011-04-22 Thread Craig Taverner

 Good catch, forgot to add the in-graph representation of the results to my
 mail, thanks for adding that part. Temporary (transient) nodes and
 relationships would really rock here, with the advantage that with HA you
 have them distributed to all cluster nodes.
 Certainly Craig has to add some interesting things to this, as those
 resemble probably his in graph indexes / R-Trees.


I certainly make use of this model, much more so for my statistical analysis
than for graph indexes (but I'm planning to merge indexes and statistics).

However, in my case the structures are currently very domain specific. But I
think the idea is sound and should be generalizable. What I do is have a
concept of a 'dataset' on which queries can be performed. The dataset is
usually the root of a large sub-graph. The query parser (domain specific)
creates a hashcode of the query, checks if the dataset node already has a
resultset (as a connected sub-graph with its own root node containing the
previous query hashcode), and if so return that (traverse it), otherwise
perform the complete dataset traversal, creating the resultset as a new
subgraph and then return it. This works well specifically for statistical
queries, where the resultset is much smaller than the dataset, so adding new
subgraphs has small impact on the database size, and the resultset is much
faster to return, so this is a performance enhancement for multiple requests
from the client. Also, I keep the resultset permanently, not temporarily.
Very few operations modify the dataset, and if they do, we delete all
resultsets, and they get re-created the next time. My work on merging the
indexes with the statistics is also planned to only recreate 'dirty' subsets
of the result-set, so modifying the dataset has minimal impact on the query
performance.

After reading Rick's previous email I started thinking of approaches to
generalizing this, but I think your 'transient' nodes perhaps encompass
everything I thought about. Here is an idea:

   - Have new nodes/relations/properties tables on disk, like a second graph
   database, but different in the sense that it has one-way relations into the
   main database, which cannot be seen by the main graph and so are by
   definition not part of the graph. These can have transience and expiry
   characteristics. Then we can build the resultset graphs as transient graphs
   in the transient database, with 'drill-down' capabilities to the original
   graph (something I find I always need for statistical queries, and something
   a graph is simply much better at than a relational database).
   - Use some kind of hashcode in the traversal definition or query to
   identify existing, cached, transient graphs in the second database, so you
   can rely on those for repeated queries, or pagination or streaming, etc.

As traversers are lazy a count operation is not so easily possible, you
 could run the traversal and discard the results. But then the client could
 also just pull those results until it reaches its
 internal tresholds and then decide to use more filtering or stop the
 pulling and ask the user for more filtering (you can always retrieve n+1 and
 show the user that there are more that n results available).


Yes. Count needs to perform the traversal. So the only way to not have to
traverse twice is to keep a cache. If we make the cache a transient
sub-graph (possibly in the second database I described above), then we have
the interesting behaviour that count() takes a while, but subsequent
queries, pagination or streaming, are fast.

Please don't forget that a count() query in a RDBMS can be as ridicully
 expensive as the original query (especially if just the column selection was
 replaced with count, and sorting, grouping etc was still left in place
 together with lots of joins).


Good to hear they have the same problem as us :-)
(or even more problems)

Sorting on your own instead of letting the db do that mostly harms the
 performance as it requires you to build up all the data in memory, sort it
 and then use it. Instead of having the db do that more efficiently, stream
 the data and you can use it directly from the stream.


Client side sorting makes sense if you know the domain well enough to know,
for example, you will receive a small enough result set to 'fit' in the
client, and want to give the user multiple interactive sort options without
hitting the database again. But I agree that in general it makes sense to
get the database to do the sort.

Cheers, Craig
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] REST results pagination

2011-04-22 Thread Rick Otten
 Client side sorting makes sense if you know the domain well enough to
 know, for example, you will receive a small enough result set to 'fit'
 in the client, and want to give the user multiple interactive sort
 options without hitting the database again. But I agree that in general
it makes sense to get the database to do the sort.


I'll concede this point.  In general it should be better to do the sorts
on the database server, which is typically by design a hefty backend
system that is optimized for that sort of processing.

In my experience with regular SQL databases, unfortunately they typically
only scale vertically, and are usually running on expensive
enterprise-grade hardware.  Most of the ones I've worked either run on
minimally sized hardware or have quickly outgrown their hardware.

So they are always either:
  1) Currently suffering from a capacity problem.
  2) Just recovering from a capacity problem.
  3) Heading rapidly towards a new capacity problem.

The next problem I run into is a political, rather than technical one. 
The database administration team is often a different group of people from
the appserver/front end development team.   The guys writing the queries
are usually closer to the appserver than the database.  In other words, it
is easier for them to manage a problem in the appserver, than it is to
manage a problem in the database.

So, instead of having a deep well of data processing power to draw on, and
then using a wide layer of thin commodity hardware presentation layer
servers, we end up transferring data processing power out of the data
server and into the presentation layer.

As we evolve into building data processing systems which can scale
horizontally on commodity hardware, the perpetual capacity problems the
legacy vertical databases suffer from may wane, finally freeing the other
layers from having to pick up some of the slack.


-- 
Rick Otten
rot...@windfish.net
O=='=+


___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] REST results pagination

2011-04-22 Thread Tobias Ivarsson
On Thu, Apr 21, 2011 at 11:18 PM, Michael Hunger 
michael.hun...@neotechnology.com wrote:

 Rick,

 great thoughts.

 Good catch, forgot to add the in-graph representation of the results to my
 mail, thanks for adding that part. Temporary (transient) nodes and
 relationships would really rock here, with the advantage that with HA you
 have them distributed to all cluster nodes.
 Certainly Craig has to add some interesting things to this, as those
 resemble probably his in graph indexes / R-Trees.

 As traversers are lazy a count operation is not so easily possible, you
 could run the traversal and discard the results. But then the client could
 also just pull those results until it reaches its
 internal tresholds and then decide to use more filtering or stop the
 pulling and ask the user for more filtering (you can always retrieve n+1 and
 show the user that there are more that n results available).

 The index result size() method only returns an estimate of the result size
 (which might not contain currently changed index entries).

 Please don't forget that a count() query in a RDBMS can be as ridicully
 expensive as the original query (especially if just the column selection was
 replaced with count, and sorting, grouping etc was still left in place
 together with lots of joins).

 Sorting on your own instead of letting the db do that mostly harms the
 performance as it requires you to build up all the data in memory, sort it
 and then use it. Instead of having the db do that more efficiently, stream
 the data and you can use it directly from the stream.


throw new SlapOnTheFingersException(sometimes the application developer can
do a better job since she has better knowledge of the data, the database
only has generic knowledge);

Since Jake had already mentioned (in this very thread) that he expected one
of those, I thought I might as well throw one in there.

I agree with the analysis of count(), as the name (count) implies, it will
have to run the entire query in order to count the number of resulting
items.

About sorting I'm torn. The perception of sorting in the database being slow
that Rick points to is one that I've seen a lot. When you hand the
responsibility of sorting to the database you hide the fact that sorting is
an expensive operation, it does require reading in all data in order to sort
it. People often expect databases to be smarter than that, since they
sometimes are, but that is pretty much only when reading straight from an
index and not doing much more. A generic sort of data can never be better
than O(log(n!)) [O(log(n!)) is almost equal to, and commonly rounded to the
easier to compute function O(n log(n))]. If you put the responsibility of
sorting in the hands of the application you can sometimes utilize knowledge
about the data to do a more efficient sorting than the database could have
done. Most often by simply doing an application level filtering of the data
before sorting it, based on some filtering that could not be transfered to
the database query. This does however make the work of the application
developer slightly more tedious, which is why I think it would be sensible
to have support for sorting on the database level, and hope that users will
be sensible about using it, and not assume magic from it.

Something I find very interesting is the concept of semi-sorted data.
Semi-sorted data is often good enough, easier to achieve, and quite easy to
then sort completely if that is required. Examples of semi-sorted data could
be data in an order that satisfies the heap property. Or for spatial queries
returning the closest hits first, but not necessarily in perfect order, say
returning the hits within a miles radius first, before the ones in a radius
between 1-10 miles, and so on, without requiring the hits in each 'segment'
to be perfectly ordered by distance. Breadth first order is another example
of semi-sorted data, that could be used when traversing data as you've
outlined with paging nodes, or similarly grouped by parent node-order.

I must say that I really enjoy following this discussion. I really like the
idea of streaming, since I think that can be implemented more easily than
paging, while satisfying many of the desired use cases. But I still want to
hear more arguments for and against both alternatives. And as has already
been pointed out, they aren't mutually exclusive.

I'll keep listening in on the conversation, but I don't have much more to
add at this point. I have one desire for the structure of the conversation
though. When you quote what someone else has said before you, could you
please include who that person was, it makes going back and reading the full
context easier.

Cheers,
-- 
Tobias Ivarsson tobias.ivars...@neotechnology.com
Hacker, Neo Technology
www.neotechnology.com
Cellphone: +46 706 534857
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] REST results pagination

2011-04-22 Thread Jim Webber
Hi Michael,

 Just in case we're not talking about the same kind of streaming --
 when I think streaming, I think streaming uploads, streaming
 downloads, etc.

I'm thinking chunked transfers. That is the server starts sending a response 
and then eventually terminates it when the whole response has been sent to the 
client.

Although it seems a bit rude, the client could simply opt to close the 
connection when it's read enough providing what it has read makes sense. 
Sometimes document fragments can make sense:

results
   node id=1234
 property name=planet value=Earth/
   /node
   node id=1235
property name=planet value=Mars/
   /node
!-- client gets bored here and kills the connection missing out on what would 
have followed --
   node id=1236
property name=planet value=Jupiter/
   /node
   node id=1237
property name=planet value=Saturn/
   /node
/results

In this case we certainly don't have well-formed XML, but some streaming API 
(e.g. stax) might already have been able to create some local objects on the 
client side as the Earth and Mars nodes came in.

I don't think this is elegant at all, but it might be practical. I've asked 
Mark Nottingham for his view on this since he's pretty sensible about Web 
things.

Jim




___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] REST results pagination

2011-04-22 Thread Jim Webber
Hi Georg,

It would at least have to be an iterator over pages - otherwise the results 
tend to be fine-grained and so horribly inefficient for sending over a network.

Jim

On 22 Apr 2011, at 18:24, Georg Summer wrote:

 I might be a little newbish here, but then why not an Iterator?
 The iterator lives on the server and is accessible through the REST
 interface, providing a advance and value method. It either operates on a
 stored and once-created-stable result set or holds the query and evaluates
 it on demand (issues of changing underlying graph included).
 
 The client can have paginator functionality by advancing and derefing the
 iterator n times or streaming-like behaviour by constantly pushing the
 obtained data into a queue and keep on going.
 
 If the client does not need the iterator anymore he simple stops using it
 and a timeout kills it eventually on the server. a client-callable delete
 method for the iterator would work as well.
 
 
 Georg
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] REST results pagination

2011-04-22 Thread Rick Bullotta
Sax (or stax) is an example of streaming with a higher level format, but there 
are plenty of other ways as well.  The *critical* performance element is to 
*never* have to accumulate an entire intermediate document on either side (eg 
json object or xml Dom) if you can avoid it.  You end up requiring 4x the 
resources (or more), extra latency, more parsing, and more garbage collection.

I'll get with Jim webber and propose a prototype of alternatives.

Note also that the lack of binary I/O in the browser without 
flash/java/silverlight is a challenge, but we can work around it.




- Reply message -
From: Michael DeHaan michael.deh...@gmail.com
Date: Fri, Apr 22, 2011 12:18 pm
Subject: [Neo4j] REST results pagination
To: Neo4j user discussions user@lists.neo4j.org

On Thu, Apr 21, 2011 at 5:00 PM, Michael Hunger
michael.hun...@neotechnology.com wrote:
 Really cool discussion so far,

 I would also prefer streaming over paging as with that approach we can give 
 both ends more of the control they need.

Just in case we're not talking about the same kind of streaming --
when I think streaming, I think streaming uploads, streaming
downloads, etc.

If the REST format is JSON (or XML, whatever), that's a /document/ so
you can't just say read the next (up to) 512 bytes and work on it.
It becomes a more low-level endeavor because if you're in the middle
of reading a record, or don't even have the end of list terminator,
what you have isn't parseable yet.  I'm sure a lot of hacking could be
done to make the client figure out if he had enough other than the
closing array element, but it's a lot to ask of a JSON client.

So I'm interested in how, in that proposal, the REST API might stream
results to a client, because for the streaming to be meaningful, you
need to be able to parse what you get back and know where the
boundaries are (or build a buffer until you fill in a datastructure
enough to operate on it).

I don't see that working with JSON/REST so much.   It seems to imply a
message bus.

--Michael
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] REST results pagination

2011-04-22 Thread Rick Bullotta
That would need to hold resources on the server (potentially for an 
indeterminate amount of time) since it must be stateful.  In general, stateful 
apis do not scale well in cases of dynamic queries.






- Reply message -
From: Georg Summer georg.sum...@gmail.com
Date: Fri, Apr 22, 2011 1:25 pm
Subject: [Neo4j] REST results pagination
To: Neo4j user discussions user@lists.neo4j.org

I might be a little newbish here, but then why not an Iterator?
The iterator lives on the server and is accessible through the REST
interface, providing a advance and value method. It either operates on a
stored and once-created-stable result set or holds the query and evaluates
it on demand (issues of changing underlying graph included).

The client can have paginator functionality by advancing and derefing the
iterator n times or streaming-like behaviour by constantly pushing the
obtained data into a queue and keep on going.

If the client does not need the iterator anymore he simple stops using it
and a timeout kills it eventually on the server. a client-callable delete
method for the iterator would work as well.


Georg

On 22 April 2011 18:43, Jim Webber j...@neotechnology.com wrote:

 Hi Michael,

  Just in case we're not talking about the same kind of streaming --
  when I think streaming, I think streaming uploads, streaming
  downloads, etc.

 I'm thinking chunked transfers. That is the server starts sending a
 response and then eventually terminates it when the whole response has been
 sent to the client.

 Although it seems a bit rude, the client could simply opt to close the
 connection when it's read enough providing what it has read makes sense.
 Sometimes document fragments can make sense:

 results
   node id=1234
 property name=planet value=Earth/
   /node
   node id=1235
property name=planet value=Mars/
   /node
 !-- client gets bored here and kills the connection missing out on what
 would have followed --
   node id=1236
property name=planet value=Jupiter/
   /node
   node id=1237
property name=planet value=Saturn/
   /node
 /results

 In this case we certainly don't have well-formed XML, but some streaming
 API (e.g. stax) might already have been able to create some local objects on
 the client side as the Earth and Mars nodes came in.

 I don't think this is elegant at all, but it might be practical. I've asked
 Mark Nottingham for his view on this since he's pretty sensible about Web
 things.

 Jim




 ___
 Neo4j mailing list
 User@lists.neo4j.org
 https://lists.neo4j.org/mailman/listinfo/user

___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] REST results pagination

2011-04-22 Thread Rick Bullotta
I'll be happy to host the streaming rest api summit.  Ample amounts of beer 
will be provided.;-)


- Reply message -
From: Jim Webber j...@neotechnology.com
Date: Fri, Apr 22, 2011 1:46 pm
Subject: [Neo4j] REST results pagination
To: Neo4j user discussions user@lists.neo4j.org

Hi Georg,

It would at least have to be an iterator over pages - otherwise the results 
tend to be fine-grained and so horribly inefficient for sending over a network.

Jim

On 22 Apr 2011, at 18:24, Georg Summer wrote:

 I might be a little newbish here, but then why not an Iterator?
 The iterator lives on the server and is accessible through the REST
 interface, providing a advance and value method. It either operates on a
 stored and once-created-stable result set or holds the query and evaluates
 it on demand (issues of changing underlying graph included).

 The client can have paginator functionality by advancing and derefing the
 iterator n times or streaming-like behaviour by constantly pushing the
 obtained data into a queue and keep on going.

 If the client does not need the iterator anymore he simple stops using it
 and a timeout kills it eventually on the server. a client-callable delete
 method for the iterator would work as well.


 Georg
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] REST results pagination

2011-04-22 Thread Michael Hunger
And you would want to reuse your connection so you don't have to pay this 
penalty per request

Just asking how would duch a REST Resource iterator look like -URI, verbs, 
request,response formats?

I assume then evety query (index,traversal) would just return the iterator URI 
for later consumption. If we store the query and/or result information (as 
discussed by Crsig and others) at the node returned as iterator this would be 
a nice fit.

M
Sent from my iBrick4


Am 22.04.2011 um 19:46 schrieb Jim Webber j...@neotechnology.com:

 Hi Georg,
 
 It would at least have to be an iterator over pages - otherwise the results 
 tend to be fine-grained and so horribly inefficient for sending over a 
 network.
 
 Jim
 
 On 22 Apr 2011, at 18:24, Georg Summer wrote:
 
 I might be a little newbish here, but then why not an Iterator?
 The iterator lives on the server and is accessible through the REST
 interface, providing a advance and value method. It either operates on a
 stored and once-created-stable result set or holds the query and evaluates
 it on demand (issues of changing underlying graph included).
 
 The client can have paginator functionality by advancing and derefing the
 iterator n times or streaming-like behaviour by constantly pushing the
 obtained data into a queue and keep on going.
 
 If the client does not need the iterator anymore he simple stops using it
 and a timeout kills it eventually on the server. a client-callable delete
 method for the iterator would work as well.
 
 
 Georg
 ___
 Neo4j mailing list
 User@lists.neo4j.org
 https://lists.neo4j.org/mailman/listinfo/user
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] REST results pagination

2011-04-22 Thread Michael Hunger
I spent some time looking at what others are doing for inspiration.

I kind of like the Riak/Basho approach with multipart-chunks and the approach 
of explictely creating a resource for the query that can be navigated (either 
via pages or first,next,[prev,last] links) and expires (and could be 
reconstructed).

Cheers

Michael

Good discussion: 
http://stackoverflow.com/questions/924472/paging-in-a-rest-collection

CouchDB: 
http://wiki.apache.org/couchdb/HTTP_Document_API
startKey + limit, endKey + limit, sorting, insert/update order

Mongooese: [cursor-id]+batch_size


OrientDB: .../[limit]

Sones: no real rest API, but a SQL on top of the graph: 
http://developers.sones.de/documentation/graph-query-language/select/
with limit, offset, but also depth (for graph)

HBase explcitly creates scanners, which can be then access with next 
operations, and expire after no activity for a certain timeout


riak:
http://wiki.basho.com/REST-API.html
 client-id header for client identification - sticky?
optional query parameters for including properties, and if to stream the data 
keys=[true,false,stream]

If “keys=stream”, the response will be transferred using chunked-encoding, 
where each chunk is a JSON object. The first chunk will contain the “props” 
entry (if props was not set to false). Subsequent chunks will contain 
individual JSON objects with the “keys” entry containing a sublist of the total 
keyset (some sublists may be empty).
riak seems to support partial json, non closed elements: -d 
'{props:{n_val:5'

returns multiple responses in one go, Content-Type: multipart/mixed; 
boundary=YinLMzyUR9feB17okMytgKsylvh

--YinLMzyUR9feB17okMytgKsylvh
Content-Type: application/x-www-form-urlencoded
Link: /riak/test; rel=up
Etag: 16vic4eU9ny46o4KPiDz1f
Last-Modified: Wed, 10 Mar 2010 18:01:06 GMT

{bar:baz}
(this block can be repeated n times)
--YinLMzyUR9feB17okMytgKsylvh--
* Connection #0 to host 127.0.0.1 left intact
* Closing connection #0

Query results:
Content-Type – always multipart/mixed, with a boundary specified
Understanding the response body

The response body will always be multipart/mixed, with each chunk representing 
a single phase of the link-walking query. Each phase will also be encoded in 
multipart/mixed, with each chunk representing a single object that was found. 
If no objects were found or “keep” was not set on the phase, no chunks will be 
present in that phase. Objects inside phase results will include Location 
headers that can be used to determine bucket and key. In fact, you can treat 
each object-chunk similarly to a complete response from read object, without 
the status code.
 HTTP/1.1 200 OK
 Server: MochiWeb/1.1 WebMachine/1.6 (eat around the stinger)
 Expires: Wed, 10 Mar 2010 20:24:49 GMT
 Date: Wed, 10 Mar 2010 20:14:49 GMT
 Content-Type: multipart/mixed; boundary=JZi8W8pB0Z3nO3odw11GUB4LQCN
 Content-Length: 970


--JZi8W8pB0Z3nO3odw11GUB4LQCN
Content-Type: multipart/mixed; boundary=OjZ8Km9J5vbsmxtcn1p48J91cJP

--OjZ8Km9J5vbsmxtcn1p48J91cJP
Content-Type: application/json
Etag: 3pvmY35coyWPxh8mh4uBQC
Last-Modified: Wed, 10 Mar 2010 20:14:13 GMT

{riak:CAP}
--OjZ8Km9J5vbsmxtcn1p48J91cJP--

--JZi8W8pB0Z3nO3odw11GUB4LQCN
Content-Type: multipart/mixed; boundary=RJKFlAs9PrdBNfd74HANycvbA8C

--RJKFlAs9PrdBNfd74HANycvbA8C
Location: /riak/test/doc2
Content-Type: application/json
Etag: 6dQBm9oYA1mxRSH0e96l5W
Last-Modified: Wed, 10 Mar 2010 18:11:41 GMT

{foo:bar}
--RJKFlAs9PrdBNfd74HANycvbA8C--

--JZi8W8pB0Z3nO3odw11GUB4LQCN--
* Connection #0 to host 127.0.0.1 left intact
* Closing connection #0

Riak - MapReduce:
Optional query parameters:

* chunked – when set to true, results will be returned one at a time in 
multipart/mixed format using chunked-encoding.
Important headers:

* Content-Type – application/json when chunked is not true, 
otherwise multipart/mixed with application/json parts

Other interesting endpoints: /ping, /stats
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] REST results pagination

2011-04-21 Thread Georg Summer
Legacy application that just uses a new data source. It can be quite hard to
get users away from their trusty old-chap-UI.  In the case of Pagination,
Legacy might only mean some years but still legacy :-).

@1-2) In the wake of mobile applications and mobile sites a pagination
system might be more relevant than bulk loading everything and displaying
it. defining smart filters might be problematic in such a use case as well.

Parallelism of an application could also be a interesting aspect. Each
worker retrieves the different pages of the graph and the user does not have
to care at all about separating the graph after downloading it. This would
only be interesting though if the graph relations are not important.

Georg

On 21 April 2011 14:59, Rick Bullotta rick.bullo...@thingworx.com wrote:

 Fwiw, I think paging is an outdated crutch, for a few reasons:

 1) bandwidth and browser processing/parsing are largely non issues,
 although they used to be

 2) human users rarely have the patience (and usability sucks) to go beyond
 2-4 pages of information.  It is far better to allow incrementally refined
 filters and searches to get to a workable subset of data.

 3) machine users could care less about paging

 4) when doing visualization of a large dataset, you generally want the
 whole dataset, not a page of it, so that's another non use case

 Discuss and debate please!

 Rick



 - Reply message -
 From: Craig Taverner cr...@amanzi.com
 Date: Thu, Apr 21, 2011 8:52 am
 Subject: [Neo4j] REST results pagination
 To: Neo4j user discussions user@lists.neo4j.org

 
  I assume this:
 Traverser x = Traversal.description().traverse( someNode );
 x.nodes();
 x.nodes(); // Not necessarily in the same order as previous call.
 
  If that assumption is false or there is some workaround, then I agree
 that
  this is a valid approach, and a good efficient alternative when sorting
 is
  not relevant. Glancing at the code in TraverserImpl though, it really
 looks
  like the call to .nodes  will re-run the traversal, and I thought that
  would
  mean the two calls can yield results in different order?
 

 OK. My assumptions were different. I assume that while the order is not
 easily predictable, it is reproducable as long as the underlying graph has
 not changed. If the graph changes, then the order can change also. But I
 think this is true of a relational database also, is it not?

 So, obviously pagination is expected (by me at least) to give page X as it
 is at the time of the request for page X, not at the time of the request
 for
 page 1.

 But my assumptions could be incorrect too...

 I understand, and completely agree. My problem with the approach is that I
  think its harder than it looks at first glance.
 

 I guess I cannot argue that point. My original email said I did not know if
 this idea had been solved yet. Since some of the key people involved in
 this
 have not chipped into this discussion, either we are reasonably correct in
 our ideas, or so wrong that they don't know where to begin correcting us
 ;-)

 This is what makes me push for the sorted approach - relational databases
  are doing this. I don't know how they do it, but they are, and we should
 be
  at least as good.
 

 Absolutely. We should be as good. Relational database manage to serve a
 page
 deep down the list quite fast. I must believe if they had to complete the
 traversal, sort the results and extract the page on every single page
 request, they could not be so fast. I think my ideas for the traversal are
 'supposed' to be performance enhancements, and that is why I like them ;-)

 I agree the issue of what should be indexed to optimize sorting is a
  domain-specific problem, but I think that is how relational databases
 treat
  it as well. If you want sorting to be fast, you have to tell them to
 index
  the field you will be sorting on. The only difference contra having the
  user
  put the sorting index in the graph is that relational databases will
 handle
  the indexing for you, saving you a *ton* of work, and I think we should
  too.
 

 Yes. I was discussing automatic indexing with Mattias recently. I think
 (and
 hope I am right), that once we move to automatic indexes, then it will be
 possible to put external indexes (a'la lucene) and graph indexes (like the
 ones I favour) behind the same API. In this case perhaps the database will
 more easily be able to make the right optimized decisions, and use the
 index
 for providing sorted results fast and with low memory footprint where
 possible, based on the existance or non-existance of the necessary indices.
 Then all the developer needs to do to make things really fast is put in the
 right index. For some data, that would be lucene and for others it would be
 a graph index. If we get to this point, I think we will have closed a key
 usability gap with relational databases.

 There are cases where you need to add this sort of meta data to your domain
  model

Re: [Neo4j] REST results pagination

2011-04-21 Thread Michael DeHaan

 3) machine users could care less about paging

My thoughts are that parsing very large documents can perform poorly
and requires the entire document be slurped into (available) RAM.
This puts a cap on the size of a usable resultset and slows
processing, or at least makes you pay an up-front cost, and decreases
potential for parallelism in other parts of your app?.
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] REST results pagination

2011-04-21 Thread Rick Bullotta
Fwiw, we use an idiot resistant (no such thing as idiot proof) approach 
that clamps the number of returned items on the server side by default. We 
allow the user to explicitly request to do something foolish and ask for more 
data, but it requires a conscious effort.


- Reply message -
From: Jacob Hansson ja...@voltvoodoo.com
Date: Thu, Apr 21, 2011 10:06 am
Subject: [Neo4j] REST results pagination
To: Neo4j user discussions user@lists.neo4j.org

On Thu, Apr 21, 2011 at 2:59 PM, Rick Bullotta
rick.bullo...@thingworx.comwrote:

 Fwiw, I think paging is an outdated crutch, for a few reasons:

 1) bandwidth and browser processing/parsing are largely non issues,
 although they used to be


I disagree. They have improved significantly, for sure, but that is no
reason to download massive amounts of data that will never be used.



 2) human users rarely have the patience (and usability sucks) to go beyond
 2-4 pages of information.  It is far better to allow incrementally refined
 filters and searches to get to a workable subset of data.


I agree with the suckiness of paging and the awesomeness of filtering - but
what do you do when the users filter returns 40 million results? You somehow
have to tell the user that damn, that filter, it returned forty freaking
million results, you need to refine your search buddy.

The way the user expects that to happen is through presenting a paged,
infinite scrolled or similar interface, where she can see how many results
where returned and act on that feedback.


 3) machine users could care less about paging


Agreed, streaming is a much better way for machines to talk about data that
doesn't fit in memory.


 4) when doing visualization of a large dataset, you generally want the
 whole dataset, not a page of it, so that's another non use case


Not necessarily true. You need all the data that you want to visualize, but
that is not necessarily all the data the user has asked for. You can be
clever about the visualization to keep it uncluttered, and paging-like
behaviours may be a way to do that.



 Discuss and debate please!

 Rick



 - Reply message -
 From: Craig Taverner cr...@amanzi.com
 Date: Thu, Apr 21, 2011 8:52 am
 Subject: [Neo4j] REST results pagination
 To: Neo4j user discussions user@lists.neo4j.org

 
  I assume this:
 Traverser x = Traversal.description().traverse( someNode );
 x.nodes();
 x.nodes(); // Not necessarily in the same order as previous call.
 
  If that assumption is false or there is some workaround, then I agree
 that
  this is a valid approach, and a good efficient alternative when sorting
 is
  not relevant. Glancing at the code in TraverserImpl though, it really
 looks
  like the call to .nodes  will re-run the traversal, and I thought that
  would
  mean the two calls can yield results in different order?
 

 OK. My assumptions were different. I assume that while the order is not
 easily predictable, it is reproducable as long as the underlying graph has
 not changed. If the graph changes, then the order can change also. But I
 think this is true of a relational database also, is it not?

 So, obviously pagination is expected (by me at least) to give page X as it
 is at the time of the request for page X, not at the time of the request
 for
 page 1.

 But my assumptions could be incorrect too...

 I understand, and completely agree. My problem with the approach is that I
  think its harder than it looks at first glance.
 

 I guess I cannot argue that point. My original email said I did not know if
 this idea had been solved yet. Since some of the key people involved in
 this
 have not chipped into this discussion, either we are reasonably correct in
 our ideas, or so wrong that they don't know where to begin correcting us
 ;-)

 This is what makes me push for the sorted approach - relational databases
  are doing this. I don't know how they do it, but they are, and we should
 be
  at least as good.
 

 Absolutely. We should be as good. Relational database manage to serve a
 page
 deep down the list quite fast. I must believe if they had to complete the
 traversal, sort the results and extract the page on every single page
 request, they could not be so fast. I think my ideas for the traversal are
 'supposed' to be performance enhancements, and that is why I like them ;-)

 I agree the issue of what should be indexed to optimize sorting is a
  domain-specific problem, but I think that is how relational databases
 treat
  it as well. If you want sorting to be fast, you have to tell them to
 index
  the field you will be sorting on. The only difference contra having the
  user
  put the sorting index in the graph is that relational databases will
 handle
  the indexing for you, saving you a *ton* of work, and I think we should
  too.
 

 Yes. I was discussing automatic indexing with Mattias recently. I think
 (and
 hope I am right), that once we move to automatic indexes, then it will be
 possible to put

Re: [Neo4j] REST results pagination

2011-04-21 Thread Rick Bullotta
Good dialog, btw!

- Reply message -
From: Jacob Hansson ja...@voltvoodoo.com
Date: Thu, Apr 21, 2011 10:06 am
Subject: [Neo4j] REST results pagination
To: Neo4j user discussions user@lists.neo4j.org

On Thu, Apr 21, 2011 at 2:59 PM, Rick Bullotta
rick.bullo...@thingworx.comwrote:

 Fwiw, I think paging is an outdated crutch, for a few reasons:

 1) bandwidth and browser processing/parsing are largely non issues,
 although they used to be


I disagree. They have improved significantly, for sure, but that is no
reason to download massive amounts of data that will never be used.



 2) human users rarely have the patience (and usability sucks) to go beyond
 2-4 pages of information.  It is far better to allow incrementally refined
 filters and searches to get to a workable subset of data.


I agree with the suckiness of paging and the awesomeness of filtering - but
what do you do when the users filter returns 40 million results? You somehow
have to tell the user that damn, that filter, it returned forty freaking
million results, you need to refine your search buddy.

The way the user expects that to happen is through presenting a paged,
infinite scrolled or similar interface, where she can see how many results
where returned and act on that feedback.


 3) machine users could care less about paging


Agreed, streaming is a much better way for machines to talk about data that
doesn't fit in memory.


 4) when doing visualization of a large dataset, you generally want the
 whole dataset, not a page of it, so that's another non use case


Not necessarily true. You need all the data that you want to visualize, but
that is not necessarily all the data the user has asked for. You can be
clever about the visualization to keep it uncluttered, and paging-like
behaviours may be a way to do that.



 Discuss and debate please!

 Rick



 - Reply message -
 From: Craig Taverner cr...@amanzi.com
 Date: Thu, Apr 21, 2011 8:52 am
 Subject: [Neo4j] REST results pagination
 To: Neo4j user discussions user@lists.neo4j.org

 
  I assume this:
 Traverser x = Traversal.description().traverse( someNode );
 x.nodes();
 x.nodes(); // Not necessarily in the same order as previous call.
 
  If that assumption is false or there is some workaround, then I agree
 that
  this is a valid approach, and a good efficient alternative when sorting
 is
  not relevant. Glancing at the code in TraverserImpl though, it really
 looks
  like the call to .nodes  will re-run the traversal, and I thought that
  would
  mean the two calls can yield results in different order?
 

 OK. My assumptions were different. I assume that while the order is not
 easily predictable, it is reproducable as long as the underlying graph has
 not changed. If the graph changes, then the order can change also. But I
 think this is true of a relational database also, is it not?

 So, obviously pagination is expected (by me at least) to give page X as it
 is at the time of the request for page X, not at the time of the request
 for
 page 1.

 But my assumptions could be incorrect too...

 I understand, and completely agree. My problem with the approach is that I
  think its harder than it looks at first glance.
 

 I guess I cannot argue that point. My original email said I did not know if
 this idea had been solved yet. Since some of the key people involved in
 this
 have not chipped into this discussion, either we are reasonably correct in
 our ideas, or so wrong that they don't know where to begin correcting us
 ;-)

 This is what makes me push for the sorted approach - relational databases
  are doing this. I don't know how they do it, but they are, and we should
 be
  at least as good.
 

 Absolutely. We should be as good. Relational database manage to serve a
 page
 deep down the list quite fast. I must believe if they had to complete the
 traversal, sort the results and extract the page on every single page
 request, they could not be so fast. I think my ideas for the traversal are
 'supposed' to be performance enhancements, and that is why I like them ;-)

 I agree the issue of what should be indexed to optimize sorting is a
  domain-specific problem, but I think that is how relational databases
 treat
  it as well. If you want sorting to be fast, you have to tell them to
 index
  the field you will be sorting on. The only difference contra having the
  user
  put the sorting index in the graph is that relational databases will
 handle
  the indexing for you, saving you a *ton* of work, and I think we should
  too.
 

 Yes. I was discussing automatic indexing with Mattias recently. I think
 (and
 hope I am right), that once we move to automatic indexes, then it will be
 possible to put external indexes (a'la lucene) and graph indexes (like the
 ones I favour) behind the same API. In this case perhaps the database will
 more easily be able to make the right optimized decisions, and use the
 index
 for providing sorted results fast

Re: [Neo4j] REST results pagination

2011-04-21 Thread Jim Webber
This is indeed a good dialogue. The pagination versus streaming was something 
I'd previously had in my mind as orthogonal issues, but I like the direction 
this is going. Let's break it down to fundamentals:

As a remote client, I want to be just as rich and performant as a local client. 
Unfortunately,  Deutsch, Amdahl and Einstein are against me on that, and I 
don't think I am tough enough to defeat those guys.

So what are my choices? I know I have to be more granular to try to alleviate 
some of the network penalty so doing operations bulkily sounds great. 

Now what I need to decide is whether I control the rate at which those bulk 
operations occur or whether the server does. If I want to control those 
operations, then paging seems sensible. Otherwise a streamed (chunked) encoding 
scheme would make sense if I'm happy for the server to throw results back at me 
at its own pace. Or indeed you can mix both so that pages are streamed.

In either case if I get bored of those results, I'll stop paging or I'll 
terminate the connection.

So what does this mean for implementation on the server? I guess this is 
important since it affects the likelihood of the Neo Tech team implementing it.

If the server supports pagination, it means we need a paging controller in 
memory per paginated result set being created. If we assume that we'll only go 
forward in pages, that's effectively just a wrapper around the traversal that's 
been uploaded. The overhead should be modest, and apart from the paging 
controller and the traverser, it doesn't need much state. We would need to add 
some logic to the representation code to support next links, but that seems a 
modest task.

If the server streams, we will need to decouple the representation generation 
from the existing representation logic since that builds an in-memory 
representation which is then flushed. Instead we'll need a streaming 
representation implementation which seems to be a reasonable amount of 
engineering. We'll also need a new streaming binding to the REST server in 
JAX-RS land.

I'm still a bit concerned about how rude it is for a client to just drop a 
streaming connection. I've asked Mark Nottingham for his authoritative opinion 
on that. But still, this does seem popular and feasible.

Jim





___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] REST results pagination

2011-04-21 Thread Craig Taverner

 I can only think of a few use cases where loosing some of the expected
 result is ok, for instance if you want to peek at the result.


IMHO, paging is, by definition, a peek. Since the client controls when the
next page will be requested, it is not possible, or reasonable, to enforce
that the complete set of pages (if every requested) will represent a
consistent result set. This is not supported by relational databases either.
The result set, and meaning of a page, can change between requests. So it
can, and does happen, data some of the expected result is lost.

This is completely different to the streaming result, which I see Jim
commented on, and so I might just reply to his mail too :-)

I'm waiting for one of those SlapOnTheFingersExceptions' that Tobias has
 been handing out :)


My fingers are, as yet, unscathed. The slap can come at any moment! :-)

This sounds really cool, would be a great thing to look into!


Should you want examples, I have a wiki page on this topic at
http://redmine.amanzi.org/wiki/geoptima/Geoptima_Event_Log

http://redmine.amanzi.org/wiki/geoptima/Geoptima_Event_Log
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] REST results pagination

2011-04-21 Thread Craig Taverner
I think Jim makes a great point about the differences between paging and
streaming, being client or server controlled. I think there is a related
point to be made, and that is that paging does not, and cannot, guarantee a
consistent total result set. Since the database can change between pages
requests, they can be inconsistent. It is possible for the same record to
appear in two pages, or for a record to be missed. This is certainly how
relational databases work in this regard.

But in the streaming case, we expect a complete and consistent result set.
Unless, of course, the client cuts off the stream. The use case is very
different, while paging is about getting a peek at the data, and rarely
about paging all the way to the end, streaming is about getting the entire
result, but streamed for efficiency.

On Thu, Apr 21, 2011 at 5:00 PM, Jim Webber j...@neotechnology.com wrote:

 This is indeed a good dialogue. The pagination versus streaming was
 something I'd previously had in my mind as orthogonal issues, but I like the
 direction this is going. Let's break it down to fundamentals:

 As a remote client, I want to be just as rich and performant as a local
 client. Unfortunately,  Deutsch, Amdahl and Einstein are against me on that,
 and I don't think I am tough enough to defeat those guys.

 So what are my choices? I know I have to be more granular to try to
 alleviate some of the network penalty so doing operations bulkily sounds
 great.

 Now what I need to decide is whether I control the rate at which those bulk
 operations occur or whether the server does. If I want to control those
 operations, then paging seems sensible. Otherwise a streamed (chunked)
 encoding scheme would make sense if I'm happy for the server to throw
 results back at me at its own pace. Or indeed you can mix both so that pages
 are streamed.

 In either case if I get bored of those results, I'll stop paging or I'll
 terminate the connection.

 So what does this mean for implementation on the server? I guess this is
 important since it affects the likelihood of the Neo Tech team implementing
 it.

 If the server supports pagination, it means we need a paging controller in
 memory per paginated result set being created. If we assume that we'll only
 go forward in pages, that's effectively just a wrapper around the traversal
 that's been uploaded. The overhead should be modest, and apart from the
 paging controller and the traverser, it doesn't need much state. We would
 need to add some logic to the representation code to support next links,
 but that seems a modest task.

 If the server streams, we will need to decouple the representation
 generation from the existing representation logic since that builds an
 in-memory representation which is then flushed. Instead we'll need a
 streaming representation implementation which seems to be a reasonable
 amount of engineering. We'll also need a new streaming binding to the REST
 server in JAX-RS land.

 I'm still a bit concerned about how rude it is for a client to just drop
 a streaming connection. I've asked Mark Nottingham for his authoritative
 opinion on that. But still, this does seem popular and feasible.

 Jim





 ___
 Neo4j mailing list
 User@lists.neo4j.org
 https://lists.neo4j.org/mailman/listinfo/user

___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] REST results pagination

2011-04-21 Thread Rick Bullotta
Jim, we should schedule a group chat on this topic.



- Reply message -
From: Jim Webber j...@neotechnology.com
Date: Thu, Apr 21, 2011 11:01 am
Subject: [Neo4j] REST results pagination
To: Neo4j user discussions user@lists.neo4j.org

This is indeed a good dialogue. The pagination versus streaming was something 
I'd previously had in my mind as orthogonal issues, but I like the direction 
this is going. Let's break it down to fundamentals:

As a remote client, I want to be just as rich and performant as a local client. 
Unfortunately,  Deutsch, Amdahl and Einstein are against me on that, and I 
don't think I am tough enough to defeat those guys.

So what are my choices? I know I have to be more granular to try to alleviate 
some of the network penalty so doing operations bulkily sounds great.

Now what I need to decide is whether I control the rate at which those bulk 
operations occur or whether the server does. If I want to control those 
operations, then paging seems sensible. Otherwise a streamed (chunked) encoding 
scheme would make sense if I'm happy for the server to throw results back at me 
at its own pace. Or indeed you can mix both so that pages are streamed.

In either case if I get bored of those results, I'll stop paging or I'll 
terminate the connection.

So what does this mean for implementation on the server? I guess this is 
important since it affects the likelihood of the Neo Tech team implementing it.

If the server supports pagination, it means we need a paging controller in 
memory per paginated result set being created. If we assume that we'll only go 
forward in pages, that's effectively just a wrapper around the traversal that's 
been uploaded. The overhead should be modest, and apart from the paging 
controller and the traverser, it doesn't need much state. We would need to add 
some logic to the representation code to support next links, but that seems a 
modest task.

If the server streams, we will need to decouple the representation generation 
from the existing representation logic since that builds an in-memory 
representation which is then flushed. Instead we'll need a streaming 
representation implementation which seems to be a reasonable amount of 
engineering. We'll also need a new streaming binding to the REST server in 
JAX-RS land.

I'm still a bit concerned about how rude it is for a client to just drop a 
streaming connection. I've asked Mark Nottingham for his authoritative opinion 
on that. But still, this does seem popular and feasible.

Jim





___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] REST results pagination

2011-04-21 Thread Michael Hunger
Really cool discussion so far,

I would also prefer streaming over paging as with that approach we can give 
both ends more of the control they need.

The server doesn't have to keep state over a long time (and also implement 
timeouts and clearing of that state, and keeping that state for lots of clients 
also adds up).
The client can decide how much of the result he's interested in, if it is just 
1 entry or 100k and then just drop the connection.
Streaming calls can also have a request-timeout, so keeping those open for too 
long (with no activity) will close them automatically.
Server doesn't use up lots of memory for streaming, one could even leverage the 
lazyness of traversers (and indexes) for not even executing/fetching results 
that are not going to be sent over the wire.

This should accommodate every kind of client from the mobile phone which only 
lists a few entries, to the big machine that can eat a firehose of result data 
in milliseconds.

For this kind of look-ahead support we could (and should) add an possible 
offset, so that a client can request data (whose order _he_ is sure hasn't 
changed) by having the server skipping the first n entries (so they don't have 
to be serialized/put on the wire).

I also think that this streaming API could already address many of the 
pain-points of the current REST API. Perhaps we even want to provide a 
streaming interface in both directions, having the client being able to for 
instance stream the creation of nodes and relationships and their indexing 
without restarting a connection for each operation. Whatever comes in this 
stream could also be processed in one TX (or with TX tokens embedded in the 
stream the client could even control that).

The only question that is posing here for me is if we want to put it on top of 
the existing REST API or rather create a more concise API/formats for that 
(with the later option of the format even degrading to binary for high bandwith 
interaction). I'd prefer the latter.

Cheers

Michael

Am 21.04.2011 um 21:09 schrieb Rick Bullotta:

 Jim, we should schedule a group chat on this topic.
 
 
 
 - Reply message -
 From: Jim Webber j...@neotechnology.com
 Date: Thu, Apr 21, 2011 11:01 am
 Subject: [Neo4j] REST results pagination
 To: Neo4j user discussions user@lists.neo4j.org
 
 This is indeed a good dialogue. The pagination versus streaming was something 
 I'd previously had in my mind as orthogonal issues, but I like the direction 
 this is going. Let's break it down to fundamentals:
 
 As a remote client, I want to be just as rich and performant as a local 
 client. Unfortunately,  Deutsch, Amdahl and Einstein are against me on that, 
 and I don't think I am tough enough to defeat those guys.
 
 So what are my choices? I know I have to be more granular to try to 
 alleviate some of the network penalty so doing operations bulkily sounds 
 great.
 
 Now what I need to decide is whether I control the rate at which those bulk 
 operations occur or whether the server does. If I want to control those 
 operations, then paging seems sensible. Otherwise a streamed (chunked) 
 encoding scheme would make sense if I'm happy for the server to throw results 
 back at me at its own pace. Or indeed you can mix both so that pages are 
 streamed.
 
 In either case if I get bored of those results, I'll stop paging or I'll 
 terminate the connection.
 
 So what does this mean for implementation on the server? I guess this is 
 important since it affects the likelihood of the Neo Tech team implementing 
 it.
 
 If the server supports pagination, it means we need a paging controller in 
 memory per paginated result set being created. If we assume that we'll only 
 go forward in pages, that's effectively just a wrapper around the traversal 
 that's been uploaded. The overhead should be modest, and apart from the 
 paging controller and the traverser, it doesn't need much state. We would 
 need to add some logic to the representation code to support next links, 
 but that seems a modest task.
 
 If the server streams, we will need to decouple the representation generation 
 from the existing representation logic since that builds an in-memory 
 representation which is then flushed. Instead we'll need a streaming 
 representation implementation which seems to be a reasonable amount of 
 engineering. We'll also need a new streaming binding to the REST server in 
 JAX-RS land.
 
 I'm still a bit concerned about how rude it is for a client to just drop a 
 streaming connection. I've asked Mark Nottingham for his authoritative 
 opinion on that. But still, this does seem popular and feasible.
 
 Jim
 
 
 
 
 
 ___
 Neo4j mailing list
 User@lists.neo4j.org
 https://lists.neo4j.org/mailman/listinfo/user
 ___
 Neo4j mailing list
 User@lists.neo4j.org
 https://lists.neo4j.org/mailman/listinfo/user

Re: [Neo4j] REST results pagination

2011-04-21 Thread Rick Otten
Half-baked thoughts from a neo4j newbie hacker type on this topic:

1)  I think it is very important, even with modern infrastructures, for
the client to be able to optionally throttle the result set it generates
with a query as it sees fit, and not just because of client memory and
bandwidth limitations.

With regular old SQL databases if you send a careless large query, you
can chew up significant system resources, for significant amounts of
time while it is being processed.  At a minimum, a rowcount/pagination
option allows you to build something into your client which can
minimize accidental denial of service queries.   I'm not sure if it is
possible to construct a query against a large Neo4j database that
would temporarily cripple it, but it wouldn't surprise me if you
could.


2) Sometimes with regular old SQL databases I'll run a sanity check
count() function with the query to just return the size of the expected
result set before I try to pull it back into my data structure.  Many
times count() is all I needed anyhow.   Does Neo4j have a result set
size function?  Perhaps a client that really could only handle small
result sets could run a count(), and then filter the search somehow, if
necessary, until the count() was smaller?  (I guess it would depend on the
problem domain...)

   In other words it may be possible, when it is really important, to
implement pagination logic on the client side, if you don't mind
running multiple queries for each set of data you get back.


3)  If the result set was broken into pages, you could organize the pages
in the server with a set of [temporary] graph nodes with relationships to
the results in the database -- one node for each page, and a parent node
for the result set.   If order of the pages is important, you could add
directed relationships between the page nodes.  If the order within the
pages is important you could either apply a sequence numbering to the
page-result relationship, or add directed temporary result set directed
relationships too.

Subsequent page retrievals would be new traversals based on the search
result set graph.  In a sense you would be building a temporary
graph-index I suppose.

And advantage to organizing search result sets this way is that you
could then union and intersect result sets (and do other set
operations) without a huge memory overhead.  (Which means you could
probably store millions of search results at one time, and you could
persist them through restarts.)



4) In some HA architectures you may have multiple database copies behind a
load balancer.  Would the search result pages be stored equally on all of
them?  Would the client require a sticky flag, to always go back to the
same specific server instance for more pages?

   Depending on how fast writes get propagated across the cluster
(compared to requests for the next page), if you were creating nodes as
described in (3) would that work?



5) As for sorting:

   In my experience, if I need a result set sorted from a regular SQL
database, I will usually sort it myself.  Most databases I've ever
worked with routinely have performance problems.  You can minimize
finger pointing and the risk of complicating those other performance
problems by just directing the database to get me what I need, I'll do
the rest of it back in the client.

   On the other hand, sometimes it is quicker and easier to let the
database do the work. (Usually when I can only handle the data in small
chunks on the client.)

   What I'm trying to say, is that I think sorting is going to be more
important to clients who want paginated results (ie, using resource
limited clients), than to clients who are grabbing large chunks of data
at a time (and will want to own any post-query processing steps
anyhow).


-- 
Rick Otten
rot...@windfish.net
O=='=+


___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] REST results pagination

2011-04-21 Thread Michael Hunger
Rick,

great thoughts.

Good catch, forgot to add the in-graph representation of the results to my 
mail, thanks for adding that part. Temporary (transient) nodes and 
relationships would really rock here, with the advantage that with HA you have 
them distributed to all cluster nodes.
Certainly Craig has to add some interesting things to this, as those resemble 
probably his in graph indexes / R-Trees.

As traversers are lazy a count operation is not so easily possible, you could 
run the traversal and discard the results. But then the client could also just 
pull those results until it reaches its
internal tresholds and then decide to use more filtering or stop the pulling 
and ask the user for more filtering (you can always retrieve n+1 and show the 
user that there are more that n results available).

The index result size() method only returns an estimate of the result size 
(which might not contain currently changed index entries).

Please don't forget that a count() query in a RDBMS can be as ridicully 
expensive as the original query (especially if just the column selection was 
replaced with count, and sorting, grouping etc was still left in place together 
with lots of joins). 

Sorting on your own instead of letting the db do that mostly harms the 
performance as it requires you to build up all the data in memory, sort it and 
then use it. Instead of having the db do that more efficiently, stream the data 
and you can use it directly from the stream.

Cheers

Michael

Am 21.04.2011 um 23:04 schrieb Rick Otten:

 Half-baked thoughts from a neo4j newbie hacker type on this topic:
 
 1)  I think it is very important, even with modern infrastructures, for
 the client to be able to optionally throttle the result set it generates
 with a query as it sees fit, and not just because of client memory and
 bandwidth limitations.
 
With regular old SQL databases if you send a careless large query, you
 can chew up significant system resources, for significant amounts of
 time while it is being processed.  At a minimum, a rowcount/pagination
 option allows you to build something into your client which can
 minimize accidental denial of service queries.   I'm not sure if it is
 possible to construct a query against a large Neo4j database that
 would temporarily cripple it, but it wouldn't surprise me if you
 could.
 
 
 2) Sometimes with regular old SQL databases I'll run a sanity check
 count() function with the query to just return the size of the expected
 result set before I try to pull it back into my data structure.  Many
 times count() is all I needed anyhow.   Does Neo4j have a result set
 size function?  Perhaps a client that really could only handle small
 result sets could run a count(), and then filter the search somehow, if
 necessary, until the count() was smaller?  (I guess it would depend on the
 problem domain...)
 
   In other words it may be possible, when it is really important, to
 implement pagination logic on the client side, if you don't mind
 running multiple queries for each set of data you get back.
 
 
 3)  If the result set was broken into pages, you could organize the pages
 in the server with a set of [temporary] graph nodes with relationships to
 the results in the database -- one node for each page, and a parent node
 for the result set.   If order of the pages is important, you could add
 directed relationships between the page nodes.  If the order within the
 pages is important you could either apply a sequence numbering to the
 page-result relationship, or add directed temporary result set directed
 relationships too.
 
Subsequent page retrievals would be new traversals based on the search
 result set graph.  In a sense you would be building a temporary
 graph-index I suppose.
 
And advantage to organizing search result sets this way is that you
 could then union and intersect result sets (and do other set
 operations) without a huge memory overhead.  (Which means you could
 probably store millions of search results at one time, and you could
 persist them through restarts.)
 
 
 
 4) In some HA architectures you may have multiple database copies behind a
 load balancer.  Would the search result pages be stored equally on all of
 them?  Would the client require a sticky flag, to always go back to the
 same specific server instance for more pages?
 
   Depending on how fast writes get propagated across the cluster
 (compared to requests for the next page), if you were creating nodes as
 described in (3) would that work?
 
 
 
 5) As for sorting:
 
   In my experience, if I need a result set sorted from a regular SQL
 database, I will usually sort it myself.  Most databases I've ever
 worked with routinely have performance problems.  You can minimize
 finger pointing and the risk of complicating those other performance
 problems by just directing the database to get me what I need, I'll do
 the rest of it back in the client.
 
   On the other hand, sometimes it 

Re: [Neo4j] REST results pagination

2011-04-20 Thread Michael Hunger
Hi Javier,

what would you need that for? I'm interested in the usecase.

Cheers

Michael

Am 20.04.2011 um 06:17 schrieb Javier de la Rosa:

 On Tue, Apr 19, 2011 at 10:25, Jim Webber j...@neotechnology.com wrote:
 I've just checked and that's in our list of stuff we really should do 
 because it annoys us that it's not there.
 No promises, but we do intend to work through at least some of that list for 
 the 1.4 releases.
 
 If this finally is developed, it will possible to request for all
 nodes and all relationships in some URL?
 
 
 Jim
 ___
 Neo4j mailing list
 User@lists.neo4j.org
 https://lists.neo4j.org/mailman/listinfo/user
 
 
 
 --
 Javier de la Rosa
 http://versae.es
 ___
 Neo4j mailing list
 User@lists.neo4j.org
 https://lists.neo4j.org/mailman/listinfo/user

___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] REST results pagination

2011-04-20 Thread Tim McNamara
Data export, e.g. dumping everything as CSV, DOT or RDF?

On 20 April 2011 18:33, Michael Hunger michael.hun...@neotechnology.comwrote:

 Hi Javier,

 what would you need that for? I'm interested in the usecase.

 Cheers

 Michael

 Am 20.04.2011 um 06:17 schrieb Javier de la Rosa:

  On Tue, Apr 19, 2011 at 10:25, Jim Webber j...@neotechnology.com wrote:
  I've just checked and that's in our list of stuff we really should do
 because it annoys us that it's not there.
  No promises, but we do intend to work through at least some of that list
 for the 1.4 releases.
 
  If this finally is developed, it will possible to request for all
  nodes and all relationships in some URL?
 
 
  Jim
  ___
  Neo4j mailing list
  User@lists.neo4j.org
  https://lists.neo4j.org/mailman/listinfo/user
 
 
 
  --
  Javier de la Rosa
  http://versae.es
  ___
  Neo4j mailing list
  User@lists.neo4j.org
  https://lists.neo4j.org/mailman/listinfo/user

 ___
 Neo4j mailing list
 User@lists.neo4j.org
 https://lists.neo4j.org/mailman/listinfo/user

___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] REST results pagination

2011-04-20 Thread Akhil
wont dumping a graph database as a tabular format create a huge file 
even if the number of nodes are few !!!  (for a highly  interconnected 
graph)
On 4/20/2011 2:41 AM, Tim McNamara wrote:
 Data export, e.g. dumping everything as CSV, DOT or RDF?

 On 20 April 2011 18:33, Michael Hungermichael.hun...@neotechnology.comwrote:

 Hi Javier,

 what would you need that for? I'm interested in the usecase.

 Cheers

 Michael

 Am 20.04.2011 um 06:17 schrieb Javier de la Rosa:

 On Tue, Apr 19, 2011 at 10:25, Jim Webberj...@neotechnology.com  wrote:
 I've just checked and that's in our list of stuff we really should do
 because it annoys us that it's not there.
 No promises, but we do intend to work through at least some of that list
 for the 1.4 releases.
 If this finally is developed, it will possible to request for all
 nodes and all relationships in some URL?

 Jim
 ___
 Neo4j mailing list
 User@lists.neo4j.org
 https://lists.neo4j.org/mailman/listinfo/user


 --
 Javier de la Rosa
 http://versae.es
 ___
 Neo4j mailing list
 User@lists.neo4j.org
 https://lists.neo4j.org/mailman/listinfo/user
 ___
 Neo4j mailing list
 User@lists.neo4j.org
 https://lists.neo4j.org/mailman/listinfo/user

 ___
 Neo4j mailing list
 User@lists.neo4j.org
 https://lists.neo4j.org/mailman/listinfo/user

___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] REST results pagination

2011-04-20 Thread Jacob Hansson
On Tue, Apr 19, 2011 at 10:17 PM, Michael DeHaan
michael.deh...@gmail.comwrote:

 On Tue, Apr 19, 2011 at 10:58 AM, Jim Webber j...@neotechnology.com
 wrote:
  I'd like to propose that we put this functionality into the plugin (
 https://github.com/skanjila/gremlin-translation-plugin) that Peter and I
 are currently working on, thoughts?
 
  I'm thinking that, if we do it, it should be handled through content
 negotiation. That is if you ask for application/atom then you get paged
 lists of results. I don't necessarily think that's a plugin, it's more
 likely part of the representation logic in server itself.

 This is something I've been wondering about as I may have the need to
 feed very large graphs into the system and am wondering how the REST
 API will hold up compared to the native interface.

 What happens if the result of an index query (or traversal, whatever)
 legitimately needs to return 100k results?

 Wouldn't that be a bit large for one request?   If anything, it's a
 lot of JSON to decode at once.


Yeah, we can't do this right now, and implementing it is harder than it
seems at first glance, since we first need to implement sorting of results,
otherwise the paged result will be useless. Like Jim said though, this is
another one of those *must be done* features.


 Feeds make sense for things that are feed-like, but do atom feeds
 really make sense for results of very dynamic queries that don't get
 subscribed to?
 Or, related question, is there a point where the result sets of
 operations get so large that things start to break down?   What do
 people find this to generally be?


I'm sure there are some awesome content types out there that we can look at
that will fit our uses, I don't feel confident to say if Atom is a good
choice, I've never worked with it..

The point where this breaks down I'm gonna guess is in server-side
serialization, because we currently don't stream the serialized data, but
build it up in memory and ship it off when it's done. I'd say you'll run out
of memory after 1 nodes or so on a small server, which I think
underlines how important this is to fix.



 Maybe it's not an issue, but pointers to any problems REST API usage
 has with large data sets (and solutions?) would be welcome.


Not aware of anyone bumping into these limits yet, but I'm sure we'll start
hearing about it.. The only current solution I can think of is a server
plugin that emulates this, but it would have to sort the result, and I'm
afraid that it will be hard (probably not impossible, but hard) to implement
that in a memory-efficient way that far away from the kernel. You may just
end up moving the OutOfMemeoryExceptions' to the plugin instead of the
serialization system.



 --Michael
 ___
 Neo4j mailing list
 User@lists.neo4j.org
 https://lists.neo4j.org/mailman/listinfo/user




-- 
Jacob Hansson
Phone: +46 (0) 763503395
Twitter: @jakewins
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] REST results pagination

2011-04-20 Thread Javier de la Rosa
Wow, had I known the number of replies, I had sent the e-mail much before ;)

Sorting is a very cool feature. I didn't know the hard core
implications of pagination. The only think I want is to avoid the
overload of sending thousands nodes through HTTP in JSON. Actually,
with a workaround that splits results (using offset and limit) in the
server without a real cut of the them, would be enough for me. Sending
in JSON only a concrete number of results. If this feature could be in
the core of Neo4j, perfect :-)


On Wed, Apr 20, 2011 at 08:01, Jacob Hansson ja...@voltvoodoo.com wrote:
 On Wed, Apr 20, 2011 at 11:25 AM, Craig Taverner cr...@amanzi.com wrote:

 I think sorting would need to be optional, since it is likely to be a
 performance and memory hug on large traversals. I think one of the key
 benefits of the traversal framework in the Embedded API is being able to
 traverse and 'stream' a very large graph without occupying much memory. If
 this can be achieved in the REST API (through pagination), that is a very
 good thing. I assume the main challenge is being able to freeze a traverser
 and keep it on hold between client requests for the next page. Perhaps you
 have already solved that bit?


 While I agree with you that the ability to effectively stream the results of
 a traversal is a very useful thing, I don't like the persisted traverser
 approach, for several reasons. I'm sorry if my tone below is a bit harsh, I
 don't mean it that way, I simply want to make a strong case for why I think
 the hard way is the right way in this case.

 First, the only good restful approach I can think of for doing persisted
 traversals would be to create a traversal resource (since it is an object
 that keeps persistent state), and get back an id to refer to it. Subsequent
 calls to paged results would then be to that traversal resource, updating
 its state and getting results back. Assuming this is the correct way to
 implement this, it comes with a lot of questions. Should there be a timeout
 for these resources, or is the user responsible for removing them from
 memory? What happens when the server crashes and the client can't find the
 traversal resources it has ids for?

 If we somehow solve that or find some better approach, we end up with an API
 where a client can get paged results, but two clients performing the same
 traversal on the same data may get back the same result in different order
 (see my comments on sorting based on expected traversal behaviour below).
 This means that the API is really only useful if you actually want to get
 the entire result back. If that was the problem we wanted to solve, a
 streaming solution is a much easier and faster approach than a paging
 solution.

 Second, being able to iterate over the entire result set is only half of the
 use cases we are looking to solve. The other half are the ones I mentioned
 examples of (the blog case, presenting lists of things to users and so on),
 and those are not solved by this. Forcing users of our database to pull out
 all their data over the wire and sort the whole thing, only to keep the
 first 10 items, for each user that lands on their frontpage, is not ok.

 Third, and most importantly to me, using this case to put more pressure on
 ourselves to implement real sorting is a really good thing. Sorting is
 something that *really* should be provided by us, anyone who has used a
 modern database expects this to be our problem to solve. We have a really
 good starting point for optimizing sorting algorithms, sitting as we are
 inside the kernel with our caches and indexes :)



 In my opinion, I would code the sorting as a characteristic of the graph
 itself, in order to avoid having to sort in the server (and incur the
 memory/performance hit). So that means I would use a domain-specific
 solution to sorting. Of course, generic sorting is nice also, but make it
 optional.


 I agree sorting should be an opt-in feature. Putting meta-data like sorting
 order and similar things inside the graph I think is a matter of personal
 preference, and for sure has its place as a useful optimization. I do,
 however, think that the official approach to sorting needs to be based on
 concepts familiar from other databases - define your query, and define how
 you want the result sorted. If indexes are available the database can use
 them to optimize the sorting, otherwise it will suck, but at least we're
 doing what the user wants us to do. All lessons learned in YesSQL databases
 (see what I did there?) should not be unlearned :)

 Also, the approach of sorting via the traversal itself assumes knowledge of
 which order the traverser will move through the graph, and that is not
 necessarily something that will be the same in later releases. Tobias was
 talking about cache-first traversals as an addition or even a replacement to
 depth/breadth first ones, a major optimization we cannot do if we encourage
 people to sort inside the graph.

 /Jake



Re: [Neo4j] REST results pagination

2011-04-20 Thread Javier de la Rosa
Here is my motivation for this petition. In my ideal world, everything
in Neo4j REST server that returns a list, should be something like
RequestList or QuerySet, and supports pagination and even filtering
with lookups, so I could do things like the next (sorry for the Python
syntax, I'm always thinking of the Python rest client):

 gdb.nodes.all()[2:5]
# Perform the query to server to get the nodes between the 2nd and
5th position, we assume Neo always returns ordered results in the same
way.

 gdb.nodes.filter(name__contains=neo)[:10]
# Returns only nodes with a property called name which contains
neo, and returns only the first 10.

This is important for the integration of the Neo4j Python Rest Client
in Django, because I'm currently developing an application with lazy
and user-defined schemas on top of Django and Neo4j. The listing of
nodes and relationships is a requirement for me, so the pagination is
a must in my aplication. Performing this in the application layer
instead of Neo4j server side, wastes a lot of time sending information
via REST.


On Wed, Apr 20, 2011 at 03:43, Michael Hunger
michael.hun...@neotechnology.com wrote:
 But wouldn't that really custom operation not more easily and much faster 
 done as a server plugin?

 Otherwise all the data would have to be serialized to json and deserialized 
 again and no streaming possible.

 From a server extension you could even stream and gzip that data with ease.

 Cheers

 Michael

 Am 20.04.2011 um 08:41 schrieb Tim McNamara:

 Data export, e.g. dumping everything as CSV, DOT or RDF?

 On 20 April 2011 18:33, Michael Hunger 
 michael.hun...@neotechnology.comwrote:

 Hi Javier,

 what would you need that for? I'm interested in the usecase.

 Cheers

 Michael

 Am 20.04.2011 um 06:17 schrieb Javier de la Rosa:

 On Tue, Apr 19, 2011 at 10:25, Jim Webber j...@neotechnology.com wrote:
 I've just checked and that's in our list of stuff we really should do
 because it annoys us that it's not there.
 No promises, but we do intend to work through at least some of that list
 for the 1.4 releases.

 If this finally is developed, it will possible to request for all
 nodes and all relationships in some URL?


 Jim
 ___
 Neo4j mailing list
 User@lists.neo4j.org
 https://lists.neo4j.org/mailman/listinfo/user



 --
 Javier de la Rosa
 http://versae.es
 ___
 Neo4j mailing list
 User@lists.neo4j.org
 https://lists.neo4j.org/mailman/listinfo/user

 ___
 Neo4j mailing list
 User@lists.neo4j.org
 https://lists.neo4j.org/mailman/listinfo/user

 ___
 Neo4j mailing list
 User@lists.neo4j.org
 https://lists.neo4j.org/mailman/listinfo/user

 ___
 Neo4j mailing list
 User@lists.neo4j.org
 https://lists.neo4j.org/mailman/listinfo/user




-- 
Javier de la Rosa
http://versae.es
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] REST results pagination

2011-04-20 Thread Michael DeHaan

 This is important for the integration of the Neo4j Python Rest Client
 in Django, because I'm currently developing an application with lazy
 and user-defined schemas on top of Django and Neo4j. The listing of
 nodes and relationships is a requirement for me, so the pagination is
 a must in my aplication. Performing this in the application layer
 instead of Neo4j server side, wastes a lot of time sending information
 via REST.

Well put about the listing of nodes and relationships.   That's the
use case where this comes up.

If I can't trust that my app's code indexed something correctly, or I
need to index old data later, I may need to walk the whole
graph to update the indexes, so large result sets become scary.   I
don't think I can rely on a traverse as part of the graphs might be
disjoint.

New use cases on old data mean we'll have to do that, just like adding
a new index to a SQL db.   Or if I have an index that says all nodes
of type, that result set could get very large.

In fact, I probably need to access all nodes in order to apply any new
indexes, if I can't just send a reindexing command that says
for all nodes add to index like so, etc.

If I'm understanding the server plugin thing correctly, I've got to
go write some java classes to do that... which, while I *can* do, it
would
better if it could be accessed in a language agnostic way, with
something more or less resembling a database cursor (see MongoDB's
API).

--Michael
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] REST results pagination

2011-04-20 Thread Craig Taverner
To respond to your arguments it would be worth noting a comment by Michael
DeHaan later on in this thread. He asked for 'something more or less
resembling a database cursor (see MongoDB's API).' The trick is to achieve
this without having to store a lot of state on the server, so it is robust
against server restarts or crashes.

If we compare to the SQL situation, there are two numbers passed by the
client, the page size and the offset. The state can be re-created by the
database server entirely from this information. How this is implemented in a
relational database I do not know, but whether the database is relational or
a graph, certain behaviors would be expected, like robustness against
database content changes between the requests, and coping with very long
gaps between requests. In my opinion the database cursor could be achieved
by both of the following approaches:

   - Starting the traversal from the beginning, and only returning results
   after passing the cursor offset position
   - Keeping a live traverser in the server, and continuing it from the
   previous position

Personally I think the second approach is simply a performance optimization
of the first. So robustness is achieved by having both, with the second one
working when possible (no server restarts, timeout not expiring, etc.), and
falling back to the first in other cases. This achieves performance and
robustness. What we do not need to do with either case is keep an entire
result set in memory between client requests.

Now when you add sorting into the picture, then you need to generate the
complete result-set in memory, sort, paginate and return only the requested
page. If the entire process has to be repeated for every page requested,
this could perform very badly for large result sets. I must believe that
relational databases do not do this (but I do not know how they paginate
sorted results, unless the sort order is maintained in an index).

To avoid keeping everything in memory, or repeatedly reloading everything to
memory on every page request, we need sorted results to be produced on the
stream. This can be done by keeping the sort order in an index. This is very
hard to do in a generic way, which is why I thought it best done in a domain
specific way.

Finally, I think we are really looking at two, different but valid use
cases. The need for generic sorting combined with pagination, and the need
for pagination on very large result sets. The former use case can work with
re-traversing and sorting on each client request, is fully generic, but will
perform badly on large result sets. The latter can perform adequately on
large result sets, as long as you do not need to sort (and use the database
cursor approach to avoid loading the result set into memory).

On Wed, Apr 20, 2011 at 2:01 PM, Jacob Hansson ja...@voltvoodoo.com wrote:

 On Wed, Apr 20, 2011 at 11:25 AM, Craig Taverner cr...@amanzi.com wrote:

  I think sorting would need to be optional, since it is likely to be a
  performance and memory hug on large traversals. I think one of the key
  benefits of the traversal framework in the Embedded API is being able to
  traverse and 'stream' a very large graph without occupying much memory.
 If
  this can be achieved in the REST API (through pagination), that is a very
  good thing. I assume the main challenge is being able to freeze a
 traverser
  and keep it on hold between client requests for the next page. Perhaps
 you
  have already solved that bit?
 

 While I agree with you that the ability to effectively stream the results
 of
 a traversal is a very useful thing, I don't like the persisted traverser
 approach, for several reasons. I'm sorry if my tone below is a bit harsh, I
 don't mean it that way, I simply want to make a strong case for why I think
 the hard way is the right way in this case.

 First, the only good restful approach I can think of for doing persisted
 traversals would be to create a traversal resource (since it is an object
 that keeps persistent state), and get back an id to refer to it. Subsequent
 calls to paged results would then be to that traversal resource, updating
 its state and getting results back. Assuming this is the correct way to
 implement this, it comes with a lot of questions. Should there be a timeout
 for these resources, or is the user responsible for removing them from
 memory? What happens when the server crashes and the client can't find the
 traversal resources it has ids for?

 If we somehow solve that or find some better approach, we end up with an
 API
 where a client can get paged results, but two clients performing the same
 traversal on the same data may get back the same result in different order
 (see my comments on sorting based on expected traversal behaviour below).
 This means that the API is really only useful if you actually want to get
 the entire result back. If that was the problem we wanted to solve, a
 streaming solution is a much easier and faster 

Re: [Neo4j] REST results pagination

2011-04-19 Thread Jim Webber
Hi Javier,

I've just checked and that's in our list of stuff we really should do because 
it annoys us that it's not there.

No promises, but we do intend to work through at least some of that list for 
the 1.4 releases.

Jim
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] REST results pagination

2011-04-19 Thread Saikat Kanjilal

I'd like to propose that we put this functionality into the plugin 
(https://github.com/skanjila/gremlin-translation-plugin) that Peter and I are 
currently working on, thoughts?
 From: j...@neotechnology.com
 Date: Tue, 19 Apr 2011 15:25:20 +0100
 To: user@lists.neo4j.org
 Subject: Re: [Neo4j] REST results pagination
 
 Hi Javier,
 
 I've just checked and that's in our list of stuff we really should do 
 because it annoys us that it's not there.
 
 No promises, but we do intend to work through at least some of that list for 
 the 1.4 releases.
 
 Jim
 ___
 Neo4j mailing list
 User@lists.neo4j.org
 https://lists.neo4j.org/mailman/listinfo/user
  
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] REST results pagination

2011-04-19 Thread Javier de la Rosa
On Tue, Apr 19, 2011 at 10:32, Saikat Kanjilal sxk1...@hotmail.com wrote:
 I'd like to propose that we put this functionality into the plugin 
 (https://github.com/skanjila/gremlin-translation-plugin) that Peter and I are 
 currently working on, thoughts?

+1

 From: j...@neotechnology.com
 I've just checked and that's in our list of stuff we really should do 
 because it annoys us that it's not there.
 No promises, but we do intend to work through at least some of that list for 
 the 1.4 releases.

It will be great to see the feature in the 1.4 :-)



-- 
Javier de la Rosa
http://versae.es
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] REST results pagination

2011-04-19 Thread Jim Webber
 I'd like to propose that we put this functionality into the plugin 
 (https://github.com/skanjila/gremlin-translation-plugin) that Peter and I 
 are currently working on, thoughts?

I'm thinking that, if we do it, it should be handled through content 
negotiation. That is if you ask for application/atom then you get paged lists 
of results. I don't necessarily think that's a plugin, it's more likely part of 
the representation logic in server itself.

Jim
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] REST results pagination

2011-04-19 Thread Javier de la Rosa
On Tue, Apr 19, 2011 at 10:25, Jim Webber j...@neotechnology.com wrote:
 I've just checked and that's in our list of stuff we really should do 
 because it annoys us that it's not there.
 No promises, but we do intend to work through at least some of that list for 
 the 1.4 releases.

If this finally is developed, it will possible to request for all
nodes and all relationships in some URL?


 Jim
 ___
 Neo4j mailing list
 User@lists.neo4j.org
 https://lists.neo4j.org/mailman/listinfo/user



--
Javier de la Rosa
http://versae.es
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user