Hi,
Backing Jim's algorithm with some code:
public static void main( String[] args )
{
long SAMPLE_SIZE = 10000;
EmbeddedGraphDatabase db = new EmbeddedGraphDatabase(
"path/to/db/" );
// Determine the highest possible id for the node store
long highId = ( (NeoStoreXaDataSource)
db.getConfig().getTxModule().getXaDataSourceManager().getXaDataSource(
Config.DEFAULT_DATA_SOURCE_NAME )
).getNeoStore().getNodeStore().getHighId();
System.out.println( highId + " is the highest id" );
long i = 0;
long nextId;
// Do the sampling
Random random = new Random();
while ( i < SAMPLE_SIZE )
{
nextId = Math.abs( random.nextLong() ) % highId;
try
{
db.getNodeById( nextId );
i++;
System.out.println( "id " + nextId + " is there" );
}
catch ( NotFoundException e )
{
// NotFoundException is thrown when the node asked is not in use
System.out.println( "id " + nextId + " not in use" );
}
}
db.shutdown();
}
Like already mentioned, this will be slow. Random jumps around the
graph are not something caches can keep up with - unless your whole db
fits in memory. But accessing random pieces of an on-disk file cannot
be done much faster.
cheers,
CG
On Wed, Nov 9, 2011 at 6:08 PM, Jim Webber <[email protected]> wrote:
> Hi Anders,
>
> When you do getAllNodes, you're getting back an iterable so as you point out
> the sample isn't random (unless it was written randomly to disk). If you're
> prepared to take a scattergun approach and tolerate being disk-bound, then
> you can ask for getNodeById using a made-up ID and deal with the times when
> your ID's don't resolve.
>
> It'll be slow (since the chances of having the nodes in cache are low) but as
> random as your random ID generator.
>
> Jim
> _______________________________________________
> Neo4j mailing list
> [email protected]
> https://lists.neo4j.org/mailman/listinfo/user
>
_______________________________________________
Neo4j mailing list
[email protected]
https://lists.neo4j.org/mailman/listinfo/user