Half-baked thoughts from a neo4j newbie hacker type on this topic:

1)  I think it is very important, even with modern infrastructures, for
the client to be able to optionally throttle the result set it generates
with a query as it sees fit, and not just because of client memory and
bandwidth limitations.

    With regular old SQL databases if you send a careless large query, you
can chew up significant system resources, for significant amounts of
time while it is being processed.  At a minimum, a rowcount/pagination
option allows you to build something into your client which can
minimize accidental denial of service queries.   I'm not sure if it is
possible to construct a query against a large Neo4j database that
would temporarily cripple it, but it wouldn't surprise me if you
could.


2) Sometimes with regular old SQL databases I'll run a sanity check
"count()" function with the query to just return the size of the expected
result set before I try to pull it back into my data structure.  Many
times "count()" is all I needed anyhow.   Does Neo4j have a result set
size function?  Perhaps a client that really could only handle small
result sets could run a count(), and then filter the search somehow, if
necessary, until the count() was smaller?  (I guess it would depend on the
problem domain...)

   In other words it may be possible, when it is really important, to
implement pagination logic on the client side, if you don't mind
running multiple queries for each set of data you get back.


3)  If the result set was broken into pages, you could organize the pages
in the server with a set of [temporary] graph nodes with relationships to
the results in the database -- one node for each page, and a parent node
for the result set.   If order of the pages is important, you could add
directed relationships between the page nodes.  If the order within the
pages is important you could either apply a sequence numbering to the
page-result relationship, or add directed temporary result set directed
relationships too.

    Subsequent page retrievals would be new traversals based on the search
result set graph.  In a sense you would be building a temporary
graph-index I suppose.

    And advantage to organizing search result sets this way is that you
could then union and intersect result sets (and do other set
operations) without a huge memory overhead.  (Which means you could
probably store millions of search results at one time, and you could
persist them through restarts.)



4) In some HA architectures you may have multiple database copies behind a
load balancer.  Would the search result pages be stored equally on all of
them?  Would the client require a "sticky" flag, to always go back to the
same specific server instance for more pages?

   Depending on how fast writes get propagated across the cluster
(compared to requests for the next page), if you were creating nodes as
described in (3) would that work?



5) As for sorting:

   In my experience, if I need a result set sorted from a regular SQL
database, I will usually sort it myself.  Most databases I've ever
worked with routinely have performance problems.  You can minimize
finger pointing and the risk of complicating those other performance
problems by just directing the database to get me what I need, I'll do
the rest of it back in the client.

   On the other hand, sometimes it is quicker and easier to let the
database do the work. (Usually when I can only handle the data in small
chunks on the client.)

   What I'm trying to say, is that I think sorting is going to be more
important to clients who want paginated results (ie, using resource
limited clients), than to clients who are grabbing large chunks of data
at a time (and will want to "own" any post-query processing steps
anyhow).


-- 
Rick Otten
rot...@windfish.net
O=='=+


_______________________________________________
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user

Reply via email to