Re: [Neo4j] Sampling a Neo4j instance?

Chris Gioran Wed, 09 Nov 2011 09:30:48 -0800

Hi,

Backing Jim's algorithm with some code:


    public static void main( String[] args )
    {
        long SAMPLE_SIZE = 10000;
        EmbeddedGraphDatabase db = new EmbeddedGraphDatabase(
                "path/to/db/" );
        // Determine the highest possible id for the node store
        long highId = ( (NeoStoreXaDataSource)
db.getConfig().getTxModule().getXaDataSourceManager().getXaDataSource(
                Config.DEFAULT_DATA_SOURCE_NAME )
).getNeoStore().getNodeStore().getHighId();
        System.out.println( highId + " is the highest id" );
        long i = 0;
        long nextId;

        // Do the sampling
        Random random = new Random();
        while ( i < SAMPLE_SIZE )
        {
            nextId = Math.abs( random.nextLong() ) % highId;
            try
            {
                db.getNodeById( nextId );
                i++;
                System.out.println( "id " + nextId + " is there" );
            }
            catch ( NotFoundException e )
            {
                // NotFoundException is thrown when the node asked is not in use
                System.out.println( "id " + nextId + " not in use" );
            }
        }
        db.shutdown();
    }

Like already mentioned, this will be slow. Random jumps around the
graph are not something caches can keep up with - unless your whole db
fits in memory. But accessing random pieces of an on-disk file cannot
be done much faster.

cheers,
CG

On Wed, Nov 9, 2011 at 6:08 PM, Jim Webber <[email protected]> wrote:
> Hi Anders,
>
> When you do getAllNodes, you're getting back an iterable so as you point out 
> the sample isn't random (unless it was written randomly to disk). If you're 
> prepared to take a scattergun approach and tolerate being disk-bound, then 
> you can ask for getNodeById using a made-up ID and deal with the times when 
> your ID's don't resolve.
>
> It'll be slow (since the chances of having the nodes in cache are low) but as 
> random as your random ID generator.
>
> Jim
> _______________________________________________
> Neo4j mailing list
> [email protected]
> https://lists.neo4j.org/mailman/listinfo/user
>
_______________________________________________
Neo4j mailing list
[email protected]
https://lists.neo4j.org/mailman/listinfo/user

Re: [Neo4j] Sampling a Neo4j instance?

Reply via email to