Re: [Neo4j] Sampling a Neo4j instance?

Michael Hunger Thu, 10 Nov 2011 02:14:48 -0800

Probably using an index for your nodes (could be an auto-index).

And then using an random shuffling of the results? You can pass in a lucene 
query object or query string to index.query(queryOrQueryObject).


Sth like this 
http://stackoverflow.com/questions/7201638/lucene-2-9-2-how-to-show-results-in-random-order

perhaps there is also some string based lucene query/sort syntax for it.

Michael

Am 10.11.2011 um 11:01 schrieb Chris Gioran:

> Answers inline.
> 
> 2011/11/9 Anders Lindström <andli...@hotmail.com>:
>> 
>> Thanks to the both of you. I am very grateful that you took your time to put 
>> this into code -- how's that for community!
>> I presume this way of getting 'highId' is constant in time? It looks rather 
>> messy though -- is it really the most straightforward way to do it?
> 
> This is the safest way to do it, that takes into consideration crashes
> and HA cluster membership.
> 
> Another way to do it is
> 
> long highId = db.getConfig().getIdGeneratorFactory().get( IdType.NODE
> ).getHighId();
> 
> which can return the same value with the first, if some conditions are
> met. It is shorter and cast-free but i'd still use the first way.
> 
> getHighId() is a constant time operation for both ways described - it
> is just a field access, with an additional long comparison for the
> first case.
> 
>> I am thinking about how efficient this will be. As I understand it, the 
>> "sampling misses" come from deleted nodes that once was there. But if I 
>> remember correctly, Neo4j tries to reuse these unused node indices when new 
>> nodes are added. But is an unused node index _guaranteed_ to be used given 
>> that there is one, or could inserting another node result in increasing 
>> 'highId' even though some indices below it are not used?
> 
> During the lifetime of a Neo4j instance there is no id reuse for Nodes
> and Relationships - deleted ids are saved however and will be reused
> the next time Neo4j starts. This means that if during run A you
> deleted nodes 3 and 5, the first two nodes returned by createNode() on
> the next run will have ids 3 and 5 - so highId will not change.
> Additionally, during run A, after deleting nodes 3 and 5, no new nodes
> would have the id 3 or 5. A crash (or improper shutdown) of the
> database will break this however, since the ids-to-recycle will
> probably not make it to disk.
> 
> So, in short, it is guaranteed that ids *won't* be reused in the same
> run but not guaranteed to be reused between runs.
> 
>> My conclusion is that the "sampling misses" will increase with index usage 
>> sparseness and that we will have a high rate of "sampling misses" when we 
>> had many deletes and few insertions recently. Would you agree?
> 
> Yes, that is true, especially given the cost of the "wasted" I/O and
> of handling the exception. However, this cost can go down
> significantly if you keep a hash set for the ids of nodes you have
> deleted and check that before asking for the node by id, instead of
> catching an exception. Persisting that between runs would move you
> away from encapsulated Neo4j constructs and would also be more
> efficient.
> 
>> Thanks again.
>> Regards,Anders
>> 
>>> Date: Wed, 9 Nov 2011 19:30:36 +0200
>>> From: chris.gio...@neotechnology.com
>>> To: user@lists.neo4j.org
>>> Subject: Re: [Neo4j] Sampling a Neo4j instance?
>>> 
>>> Hi,
>>> 
>>> Backing Jim's algorithm with some code:
>>> 
>>>     public static void main( String[] args )
>>>     {
>>>         long SAMPLE_SIZE = 10000;
>>>         EmbeddedGraphDatabase db = new EmbeddedGraphDatabase(
>>>                 "path/to/db/" );
>>>         // Determine the highest possible id for the node store
>>>         long highId = ( (NeoStoreXaDataSource)
>>> db.getConfig().getTxModule().getXaDataSourceManager().getXaDataSource(
>>>                 Config.DEFAULT_DATA_SOURCE_NAME )
>>> ).getNeoStore().getNodeStore().getHighId();
>>>         System.out.println( highId + " is the highest id" );
>>>         long i = 0;
>>>         long nextId;
>>> 
>>>         // Do the sampling
>>>         Random random = new Random();
>>>         while ( i < SAMPLE_SIZE )
>>>         {
>>>             nextId = Math.abs( random.nextLong() ) % highId;
>>>             try
>>>             {
>>>                 db.getNodeById( nextId );
>>>                 i++;
>>>                 System.out.println( "id " + nextId + " is there" );
>>>             }
>>>             catch ( NotFoundException e )
>>>             {
>>>                 // NotFoundException is thrown when the node asked is not 
>>> in use
>>>                 System.out.println( "id " + nextId + " not in use" );
>>>             }
>>>         }
>>>         db.shutdown();
>>>     }
>>> 
>>> Like already mentioned, this will be slow. Random jumps around the
>>> graph are not something caches can keep up with - unless your whole db
>>> fits in memory. But accessing random pieces of an on-disk file cannot
>>> be done much faster.
>>> 
>>> cheers,
>>> CG
>>> 
>>> On Wed, Nov 9, 2011 at 6:08 PM, Jim Webber <j...@neotechnology.com> wrote:
>>>> Hi Anders,
>>>> 
>>>> When you do getAllNodes, you're getting back an iterable so as you point 
>>>> out the sample isn't random (unless it was written randomly to disk). If 
>>>> you're prepared to take a scattergun approach and tolerate being 
>>>> disk-bound, then you can ask for getNodeById using a made-up ID and deal 
>>>> with the times when your ID's don't resolve.
>>>> 
>>>> It'll be slow (since the chances of having the nodes in cache are low) but 
>>>> as random as your random ID generator.
>>>> 
>>>> Jim
>>>> _______________________________________________
>>>> Neo4j mailing list
>>>> User@lists.neo4j.org
>>>> https://lists.neo4j.org/mailman/listinfo/user
>>>> 
>>> _______________________________________________
>>> Neo4j mailing list
>>> User@lists.neo4j.org
>>> https://lists.neo4j.org/mailman/listinfo/user
>> 
>> _______________________________________________
>> Neo4j mailing list
>> User@lists.neo4j.org
>> https://lists.neo4j.org/mailman/listinfo/user
>> 
> _______________________________________________
> Neo4j mailing list
> User@lists.neo4j.org
> https://lists.neo4j.org/mailman/listinfo/user

_______________________________________________
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user

Re: [Neo4j] Sampling a Neo4j instance?

Reply via email to