Answers inline.

2011/11/9 Anders Lindström <[email protected]>:
>
> Thanks to the both of you. I am very grateful that you took your time to put 
> this into code -- how's that for community!
> I presume this way of getting 'highId' is constant in time? It looks rather 
> messy though -- is it really the most straightforward way to do it?

This is the safest way to do it, that takes into consideration crashes
and HA cluster membership.

Another way to do it is

long highId = db.getConfig().getIdGeneratorFactory().get( IdType.NODE
).getHighId();

which can return the same value with the first, if some conditions are
met. It is shorter and cast-free but i'd still use the first way.

getHighId() is a constant time operation for both ways described - it
is just a field access, with an additional long comparison for the
first case.

> I am thinking about how efficient this will be. As I understand it, the 
> "sampling misses" come from deleted nodes that once was there. But if I 
> remember correctly, Neo4j tries to reuse these unused node indices when new 
> nodes are added. But is an unused node index _guaranteed_ to be used given 
> that there is one, or could inserting another node result in increasing 
> 'highId' even though some indices below it are not used?

During the lifetime of a Neo4j instance there is no id reuse for Nodes
and Relationships - deleted ids are saved however and will be reused
the next time Neo4j starts. This means that if during run A you
deleted nodes 3 and 5, the first two nodes returned by createNode() on
the next run will have ids 3 and 5 - so highId will not change.
Additionally, during run A, after deleting nodes 3 and 5, no new nodes
would have the id 3 or 5. A crash (or improper shutdown) of the
database will break this however, since the ids-to-recycle will
probably not make it to disk.

So, in short, it is guaranteed that ids *won't* be reused in the same
run but not guaranteed to be reused between runs.

> My conclusion is that the "sampling misses" will increase with index usage 
> sparseness and that we will have a high rate of "sampling misses" when we had 
> many deletes and few insertions recently. Would you agree?

Yes, that is true, especially given the cost of the "wasted" I/O and
of handling the exception. However, this cost can go down
significantly if you keep a hash set for the ids of nodes you have
deleted and check that before asking for the node by id, instead of
catching an exception. Persisting that between runs would move you
away from encapsulated Neo4j constructs and would also be more
efficient.

> Thanks again.
> Regards,Anders
>
>> Date: Wed, 9 Nov 2011 19:30:36 +0200
>> From: [email protected]
>> To: [email protected]
>> Subject: Re: [Neo4j] Sampling a Neo4j instance?
>>
>> Hi,
>>
>> Backing Jim's algorithm with some code:
>>
>>     public static void main( String[] args )
>>     {
>>         long SAMPLE_SIZE = 10000;
>>         EmbeddedGraphDatabase db = new EmbeddedGraphDatabase(
>>                 "path/to/db/" );
>>         // Determine the highest possible id for the node store
>>         long highId = ( (NeoStoreXaDataSource)
>> db.getConfig().getTxModule().getXaDataSourceManager().getXaDataSource(
>>                 Config.DEFAULT_DATA_SOURCE_NAME )
>> ).getNeoStore().getNodeStore().getHighId();
>>         System.out.println( highId + " is the highest id" );
>>         long i = 0;
>>         long nextId;
>>
>>         // Do the sampling
>>         Random random = new Random();
>>         while ( i < SAMPLE_SIZE )
>>         {
>>             nextId = Math.abs( random.nextLong() ) % highId;
>>             try
>>             {
>>                 db.getNodeById( nextId );
>>                 i++;
>>                 System.out.println( "id " + nextId + " is there" );
>>             }
>>             catch ( NotFoundException e )
>>             {
>>                 // NotFoundException is thrown when the node asked is not in 
>> use
>>                 System.out.println( "id " + nextId + " not in use" );
>>             }
>>         }
>>         db.shutdown();
>>     }
>>
>> Like already mentioned, this will be slow. Random jumps around the
>> graph are not something caches can keep up with - unless your whole db
>> fits in memory. But accessing random pieces of an on-disk file cannot
>> be done much faster.
>>
>> cheers,
>> CG
>>
>> On Wed, Nov 9, 2011 at 6:08 PM, Jim Webber <[email protected]> wrote:
>> > Hi Anders,
>> >
>> > When you do getAllNodes, you're getting back an iterable so as you point 
>> > out the sample isn't random (unless it was written randomly to disk). If 
>> > you're prepared to take a scattergun approach and tolerate being 
>> > disk-bound, then you can ask for getNodeById using a made-up ID and deal 
>> > with the times when your ID's don't resolve.
>> >
>> > It'll be slow (since the chances of having the nodes in cache are low) but 
>> > as random as your random ID generator.
>> >
>> > Jim
>> > _______________________________________________
>> > Neo4j mailing list
>> > [email protected]
>> > https://lists.neo4j.org/mailman/listinfo/user
>> >
>> _______________________________________________
>> Neo4j mailing list
>> [email protected]
>> https://lists.neo4j.org/mailman/listinfo/user
>
> _______________________________________________
> Neo4j mailing list
> [email protected]
> https://lists.neo4j.org/mailman/listinfo/user
>
_______________________________________________
Neo4j mailing list
[email protected]
https://lists.neo4j.org/mailman/listinfo/user

Reply via email to