Probably using an index for your nodes (could be an auto-index). And then using an random shuffling of the results? You can pass in a lucene query object or query string to index.query(queryOrQueryObject).
Sth like this http://stackoverflow.com/questions/7201638/lucene-2-9-2-how-to-show-results-in-random-order perhaps there is also some string based lucene query/sort syntax for it. Michael Am 10.11.2011 um 11:01 schrieb Chris Gioran: > Answers inline. > > 2011/11/9 Anders Lindström <andli...@hotmail.com>: >> >> Thanks to the both of you. I am very grateful that you took your time to put >> this into code -- how's that for community! >> I presume this way of getting 'highId' is constant in time? It looks rather >> messy though -- is it really the most straightforward way to do it? > > This is the safest way to do it, that takes into consideration crashes > and HA cluster membership. > > Another way to do it is > > long highId = db.getConfig().getIdGeneratorFactory().get( IdType.NODE > ).getHighId(); > > which can return the same value with the first, if some conditions are > met. It is shorter and cast-free but i'd still use the first way. > > getHighId() is a constant time operation for both ways described - it > is just a field access, with an additional long comparison for the > first case. > >> I am thinking about how efficient this will be. As I understand it, the >> "sampling misses" come from deleted nodes that once was there. But if I >> remember correctly, Neo4j tries to reuse these unused node indices when new >> nodes are added. But is an unused node index _guaranteed_ to be used given >> that there is one, or could inserting another node result in increasing >> 'highId' even though some indices below it are not used? > > During the lifetime of a Neo4j instance there is no id reuse for Nodes > and Relationships - deleted ids are saved however and will be reused > the next time Neo4j starts. This means that if during run A you > deleted nodes 3 and 5, the first two nodes returned by createNode() on > the next run will have ids 3 and 5 - so highId will not change. > Additionally, during run A, after deleting nodes 3 and 5, no new nodes > would have the id 3 or 5. A crash (or improper shutdown) of the > database will break this however, since the ids-to-recycle will > probably not make it to disk. > > So, in short, it is guaranteed that ids *won't* be reused in the same > run but not guaranteed to be reused between runs. > >> My conclusion is that the "sampling misses" will increase with index usage >> sparseness and that we will have a high rate of "sampling misses" when we >> had many deletes and few insertions recently. Would you agree? > > Yes, that is true, especially given the cost of the "wasted" I/O and > of handling the exception. However, this cost can go down > significantly if you keep a hash set for the ids of nodes you have > deleted and check that before asking for the node by id, instead of > catching an exception. Persisting that between runs would move you > away from encapsulated Neo4j constructs and would also be more > efficient. > >> Thanks again. >> Regards,Anders >> >>> Date: Wed, 9 Nov 2011 19:30:36 +0200 >>> From: chris.gio...@neotechnology.com >>> To: user@lists.neo4j.org >>> Subject: Re: [Neo4j] Sampling a Neo4j instance? >>> >>> Hi, >>> >>> Backing Jim's algorithm with some code: >>> >>> public static void main( String[] args ) >>> { >>> long SAMPLE_SIZE = 10000; >>> EmbeddedGraphDatabase db = new EmbeddedGraphDatabase( >>> "path/to/db/" ); >>> // Determine the highest possible id for the node store >>> long highId = ( (NeoStoreXaDataSource) >>> db.getConfig().getTxModule().getXaDataSourceManager().getXaDataSource( >>> Config.DEFAULT_DATA_SOURCE_NAME ) >>> ).getNeoStore().getNodeStore().getHighId(); >>> System.out.println( highId + " is the highest id" ); >>> long i = 0; >>> long nextId; >>> >>> // Do the sampling >>> Random random = new Random(); >>> while ( i < SAMPLE_SIZE ) >>> { >>> nextId = Math.abs( random.nextLong() ) % highId; >>> try >>> { >>> db.getNodeById( nextId ); >>> i++; >>> System.out.println( "id " + nextId + " is there" ); >>> } >>> catch ( NotFoundException e ) >>> { >>> // NotFoundException is thrown when the node asked is not >>> in use >>> System.out.println( "id " + nextId + " not in use" ); >>> } >>> } >>> db.shutdown(); >>> } >>> >>> Like already mentioned, this will be slow. Random jumps around the >>> graph are not something caches can keep up with - unless your whole db >>> fits in memory. But accessing random pieces of an on-disk file cannot >>> be done much faster. >>> >>> cheers, >>> CG >>> >>> On Wed, Nov 9, 2011 at 6:08 PM, Jim Webber <j...@neotechnology.com> wrote: >>>> Hi Anders, >>>> >>>> When you do getAllNodes, you're getting back an iterable so as you point >>>> out the sample isn't random (unless it was written randomly to disk). If >>>> you're prepared to take a scattergun approach and tolerate being >>>> disk-bound, then you can ask for getNodeById using a made-up ID and deal >>>> with the times when your ID's don't resolve. >>>> >>>> It'll be slow (since the chances of having the nodes in cache are low) but >>>> as random as your random ID generator. >>>> >>>> Jim >>>> _______________________________________________ >>>> Neo4j mailing list >>>> User@lists.neo4j.org >>>> https://lists.neo4j.org/mailman/listinfo/user >>>> >>> _______________________________________________ >>> Neo4j mailing list >>> User@lists.neo4j.org >>> https://lists.neo4j.org/mailman/listinfo/user >> >> _______________________________________________ >> Neo4j mailing list >> User@lists.neo4j.org >> https://lists.neo4j.org/mailman/listinfo/user >> > _______________________________________________ > Neo4j mailing list > User@lists.neo4j.org > https://lists.neo4j.org/mailman/listinfo/user _______________________________________________ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user