Re: [Neo4j] Sampling a Neo4j instance?

Chris Gioran Wed, 16 Nov 2011 11:40:45 -0800

No, GraphDatabaseService wisely hides those things away. I would
suggest using instanceof and casting to EmbeddedGraphDatabase.


cheers,
CG

2011/11/16 Anders Lindström <[email protected]>:
>
> Chris, thanks again for your replies.
> I realize now that I don't have the 'getConfig' method -- I'm writing a 
> server plugin and I only get the GraphDatabaseService interface passed to my 
> method, not a EmbeddedGraphDatabase. Is there an equivalent way of getting 
> the highest node index through the interface?
> Thanks.
>
>> Date: Thu, 10 Nov 2011 12:01:31 +0200
>> From: [email protected]
>> To: [email protected]
>> Subject: Re: [Neo4j] Sampling a Neo4j instance?
>>
>> Answers inline.
>>
>> 2011/11/9 Anders Lindström <[email protected]>:
>> >
>> > Thanks to the both of you. I am very grateful that you took your time to 
>> > put this into code -- how's that for community!
>> > I presume this way of getting 'highId' is constant in time? It looks 
>> > rather messy though -- is it really the most straightforward way to do it?
>>
>> This is the safest way to do it, that takes into consideration crashes
>> and HA cluster membership.
>>
>> Another way to do it is
>>
>> long highId = db.getConfig().getIdGeneratorFactory().get( IdType.NODE
>> ).getHighId();
>>
>> which can return the same value with the first, if some conditions are
>> met. It is shorter and cast-free but i'd still use the first way.
>>
>> getHighId() is a constant time operation for both ways described - it
>> is just a field access, with an additional long comparison for the
>> first case.
>>
>> > I am thinking about how efficient this will be. As I understand it, the 
>> > "sampling misses" come from deleted nodes that once was there. But if I 
>> > remember correctly, Neo4j tries to reuse these unused node indices when 
>> > new nodes are added. But is an unused node index _guaranteed_ to be used 
>> > given that there is one, or could inserting another node result in 
>> > increasing 'highId' even though some indices below it are not used?
>>
>> During the lifetime of a Neo4j instance there is no id reuse for Nodes
>> and Relationships - deleted ids are saved however and will be reused
>> the next time Neo4j starts. This means that if during run A you
>> deleted nodes 3 and 5, the first two nodes returned by createNode() on
>> the next run will have ids 3 and 5 - so highId will not change.
>> Additionally, during run A, after deleting nodes 3 and 5, no new nodes
>> would have the id 3 or 5. A crash (or improper shutdown) of the
>> database will break this however, since the ids-to-recycle will
>> probably not make it to disk.
>>
>> So, in short, it is guaranteed that ids *won't* be reused in the same
>> run but not guaranteed to be reused between runs.
>>
>> > My conclusion is that the "sampling misses" will increase with index usage 
>> > sparseness and that we will have a high rate of "sampling misses" when we 
>> > had many deletes and few insertions recently. Would you agree?
>>
>> Yes, that is true, especially given the cost of the "wasted" I/O and
>> of handling the exception. However, this cost can go down
>> significantly if you keep a hash set for the ids of nodes you have
>> deleted and check that before asking for the node by id, instead of
>> catching an exception. Persisting that between runs would move you
>> away from encapsulated Neo4j constructs and would also be more
>> efficient.
>>
>> > Thanks again.
>> > Regards,Anders
>> >
>> >> Date: Wed, 9 Nov 2011 19:30:36 +0200
>> >> From: [email protected]
>> >> To: [email protected]
>> >> Subject: Re: [Neo4j] Sampling a Neo4j instance?
>> >>
>> >> Hi,
>> >>
>> >> Backing Jim's algorithm with some code:
>> >>
>> >>     public static void main( String[] args )
>> >>     {
>> >>         long SAMPLE_SIZE = 10000;
>> >>         EmbeddedGraphDatabase db = new EmbeddedGraphDatabase(
>> >>                 "path/to/db/" );
>> >>         // Determine the highest possible id for the node store
>> >>         long highId = ( (NeoStoreXaDataSource)
>> >> db.getConfig().getTxModule().getXaDataSourceManager().getXaDataSource(
>> >>                 Config.DEFAULT_DATA_SOURCE_NAME )
>> >> ).getNeoStore().getNodeStore().getHighId();
>> >>         System.out.println( highId + " is the highest id" );
>> >>         long i = 0;
>> >>         long nextId;
>> >>
>> >>         // Do the sampling
>> >>         Random random = new Random();
>> >>         while ( i < SAMPLE_SIZE )
>> >>         {
>> >>             nextId = Math.abs( random.nextLong() ) % highId;
>> >>             try
>> >>             {
>> >>                 db.getNodeById( nextId );
>> >>                 i++;
>> >>                 System.out.println( "id " + nextId + " is there" );
>> >>             }
>> >>             catch ( NotFoundException e )
>> >>             {
>> >>                 // NotFoundException is thrown when the node asked is not 
>> >> in use
>> >>                 System.out.println( "id " + nextId + " not in use" );
>> >>             }
>> >>         }
>> >>         db.shutdown();
>> >>     }
>> >>
>> >> Like already mentioned, this will be slow. Random jumps around the
>> >> graph are not something caches can keep up with - unless your whole db
>> >> fits in memory. But accessing random pieces of an on-disk file cannot
>> >> be done much faster.
>> >>
>> >> cheers,
>> >> CG
>> >>
>> >> On Wed, Nov 9, 2011 at 6:08 PM, Jim Webber <[email protected]> wrote:
>> >> > Hi Anders,
>> >> >
>> >> > When you do getAllNodes, you're getting back an iterable so as you 
>> >> > point out the sample isn't random (unless it was written randomly to 
>> >> > disk). If you're prepared to take a scattergun approach and tolerate 
>> >> > being disk-bound, then you can ask for getNodeById using a made-up ID 
>> >> > and deal with the times when your ID's don't resolve.
>> >> >
>> >> > It'll be slow (since the chances of having the nodes in cache are low) 
>> >> > but as random as your random ID generator.
>> >> >
>> >> > Jim
>> >> > _______________________________________________
>> >> > Neo4j mailing list
>> >> > [email protected]
>> >> > https://lists.neo4j.org/mailman/listinfo/user
>> >> >
>> >> _______________________________________________
>> >> Neo4j mailing list
>> >> [email protected]
>> >> https://lists.neo4j.org/mailman/listinfo/user
>> >
>> > _______________________________________________
>> > Neo4j mailing list
>> > [email protected]
>> > https://lists.neo4j.org/mailman/listinfo/user
>> >
>> _______________________________________________
>> Neo4j mailing list
>> [email protected]
>> https://lists.neo4j.org/mailman/listinfo/user
>
> _______________________________________________
> Neo4j mailing list
> [email protected]
> https://lists.neo4j.org/mailman/listinfo/user
>
_______________________________________________
Neo4j mailing list
[email protected]
https://lists.neo4j.org/mailman/listinfo/user

Re: [Neo4j] Sampling a Neo4j instance?

Reply via email to