No, GraphDatabaseService wisely hides those things away. I would suggest using instanceof and casting to EmbeddedGraphDatabase.
cheers, CG 2011/11/16 Anders Lindström <andli...@hotmail.com>: > > Chris, thanks again for your replies. > I realize now that I don't have the 'getConfig' method -- I'm writing a > server plugin and I only get the GraphDatabaseService interface passed to my > method, not a EmbeddedGraphDatabase. Is there an equivalent way of getting > the highest node index through the interface? > Thanks. > >> Date: Thu, 10 Nov 2011 12:01:31 +0200 >> From: chris.gio...@neotechnology.com >> To: user@lists.neo4j.org >> Subject: Re: [Neo4j] Sampling a Neo4j instance? >> >> Answers inline. >> >> 2011/11/9 Anders Lindström <andli...@hotmail.com>: >> > >> > Thanks to the both of you. I am very grateful that you took your time to >> > put this into code -- how's that for community! >> > I presume this way of getting 'highId' is constant in time? It looks >> > rather messy though -- is it really the most straightforward way to do it? >> >> This is the safest way to do it, that takes into consideration crashes >> and HA cluster membership. >> >> Another way to do it is >> >> long highId = db.getConfig().getIdGeneratorFactory().get( IdType.NODE >> ).getHighId(); >> >> which can return the same value with the first, if some conditions are >> met. It is shorter and cast-free but i'd still use the first way. >> >> getHighId() is a constant time operation for both ways described - it >> is just a field access, with an additional long comparison for the >> first case. >> >> > I am thinking about how efficient this will be. As I understand it, the >> > "sampling misses" come from deleted nodes that once was there. But if I >> > remember correctly, Neo4j tries to reuse these unused node indices when >> > new nodes are added. But is an unused node index _guaranteed_ to be used >> > given that there is one, or could inserting another node result in >> > increasing 'highId' even though some indices below it are not used? >> >> During the lifetime of a Neo4j instance there is no id reuse for Nodes >> and Relationships - deleted ids are saved however and will be reused >> the next time Neo4j starts. This means that if during run A you >> deleted nodes 3 and 5, the first two nodes returned by createNode() on >> the next run will have ids 3 and 5 - so highId will not change. >> Additionally, during run A, after deleting nodes 3 and 5, no new nodes >> would have the id 3 or 5. A crash (or improper shutdown) of the >> database will break this however, since the ids-to-recycle will >> probably not make it to disk. >> >> So, in short, it is guaranteed that ids *won't* be reused in the same >> run but not guaranteed to be reused between runs. >> >> > My conclusion is that the "sampling misses" will increase with index usage >> > sparseness and that we will have a high rate of "sampling misses" when we >> > had many deletes and few insertions recently. Would you agree? >> >> Yes, that is true, especially given the cost of the "wasted" I/O and >> of handling the exception. However, this cost can go down >> significantly if you keep a hash set for the ids of nodes you have >> deleted and check that before asking for the node by id, instead of >> catching an exception. Persisting that between runs would move you >> away from encapsulated Neo4j constructs and would also be more >> efficient. >> >> > Thanks again. >> > Regards,Anders >> > >> >> Date: Wed, 9 Nov 2011 19:30:36 +0200 >> >> From: chris.gio...@neotechnology.com >> >> To: user@lists.neo4j.org >> >> Subject: Re: [Neo4j] Sampling a Neo4j instance? >> >> >> >> Hi, >> >> >> >> Backing Jim's algorithm with some code: >> >> >> >> public static void main( String[] args ) >> >> { >> >> long SAMPLE_SIZE = 10000; >> >> EmbeddedGraphDatabase db = new EmbeddedGraphDatabase( >> >> "path/to/db/" ); >> >> // Determine the highest possible id for the node store >> >> long highId = ( (NeoStoreXaDataSource) >> >> db.getConfig().getTxModule().getXaDataSourceManager().getXaDataSource( >> >> Config.DEFAULT_DATA_SOURCE_NAME ) >> >> ).getNeoStore().getNodeStore().getHighId(); >> >> System.out.println( highId + " is the highest id" ); >> >> long i = 0; >> >> long nextId; >> >> >> >> // Do the sampling >> >> Random random = new Random(); >> >> while ( i < SAMPLE_SIZE ) >> >> { >> >> nextId = Math.abs( random.nextLong() ) % highId; >> >> try >> >> { >> >> db.getNodeById( nextId ); >> >> i++; >> >> System.out.println( "id " + nextId + " is there" ); >> >> } >> >> catch ( NotFoundException e ) >> >> { >> >> // NotFoundException is thrown when the node asked is not >> >> in use >> >> System.out.println( "id " + nextId + " not in use" ); >> >> } >> >> } >> >> db.shutdown(); >> >> } >> >> >> >> Like already mentioned, this will be slow. Random jumps around the >> >> graph are not something caches can keep up with - unless your whole db >> >> fits in memory. But accessing random pieces of an on-disk file cannot >> >> be done much faster. >> >> >> >> cheers, >> >> CG >> >> >> >> On Wed, Nov 9, 2011 at 6:08 PM, Jim Webber <j...@neotechnology.com> wrote: >> >> > Hi Anders, >> >> > >> >> > When you do getAllNodes, you're getting back an iterable so as you >> >> > point out the sample isn't random (unless it was written randomly to >> >> > disk). If you're prepared to take a scattergun approach and tolerate >> >> > being disk-bound, then you can ask for getNodeById using a made-up ID >> >> > and deal with the times when your ID's don't resolve. >> >> > >> >> > It'll be slow (since the chances of having the nodes in cache are low) >> >> > but as random as your random ID generator. >> >> > >> >> > Jim >> >> > _______________________________________________ >> >> > Neo4j mailing list >> >> > User@lists.neo4j.org >> >> > https://lists.neo4j.org/mailman/listinfo/user >> >> > >> >> _______________________________________________ >> >> Neo4j mailing list >> >> User@lists.neo4j.org >> >> https://lists.neo4j.org/mailman/listinfo/user >> > >> > _______________________________________________ >> > Neo4j mailing list >> > User@lists.neo4j.org >> > https://lists.neo4j.org/mailman/listinfo/user >> > >> _______________________________________________ >> Neo4j mailing list >> User@lists.neo4j.org >> https://lists.neo4j.org/mailman/listinfo/user > > _______________________________________________ > Neo4j mailing list > User@lists.neo4j.org > https://lists.neo4j.org/mailman/listinfo/user > _______________________________________________ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user