OK, seems to be running much faster now.
The reason for such slow performance was mine, not Neo4j's.

Before the input parameters to my function allowed the user to specify
exactly how many Relationships should be removed. This meant in order to be
"fair" I had to uniformly randomly generate ID's in the key space (also
defined by input parameter). The problem with this approach is as time
passes it becomes more likely to select an ID that has already been removed.
I had checks to avoid errors as a result, but it still meant many wasted
disk reads.

//BEFORE
until (graph_is_small_enough)
  random_number = generate_uniform(0, maxId)
  random_relationship = get_relationship_by_id(random_number)
  random_relationship.delete()

Now the input parameters only let the user specify the PERCENTAGE of all
Relationships that should be kept.
This means I don't need to keep an state that tells me which ID's have
already been deleted, and I can iterate through all Relationships and never
have a "missed read" (assuming there are no wholes in the key space).

//NOW
for (index=0 to maxId)
  random_number = generate_uniform(0, maxId)
  if (random_number < percent_to_keep)
    continue;
  random_relationship = get_relationship_by_id(index)
  random_relationship.delete()

Performance is much better now. The first 1,000,000 deletions took ~4minutes

Cheers,
Alex

On Mon, May 24, 2010 at 11:24 PM, Alex Averbuch <alex.averb...@gmail.com>wrote:

> Hey,
> I have a large (by my standards) graph and I would like to reduce it's size
> so it all fits in memory.
> This is that same Twitter graph as I mentioned earlier: 2.5million Nodes
> 250million Relationships.
>
> The goal is for the graph to still have the same topology and
> characteristics after it has been made more sparse.
> My plan to do this was to uniformly randomly select Relationships for
> deletion, until the graph is small enough.
>
> My first approach is basically this:
>
> until (graph_is_small_enough)
>   random_relationship = get_relationship_by_id(random_number)
>   random_relationship.delete()
>
> I'm using the transactional GraphDatabaseService at the moment, rather than
> the BatchInserter... mostly because I'm not inserting anything and I assumed
> the optimizations made to the BatchInserter were only for write operations.
>
> The reason I want to delete Relationships instead of Nodes is
>   (1) I don't want to accidentally delete any "super nodes", as these are
> what gives Twitter it's unique structure
>   (2) The number of Nodes is not the main problem that's keeping me from
> being able to store the graph in RAM
>
> The problem with the current approach is that it feels like I'm working
> against Neo4j's strengths and it is very very slow... I waited over an hour
> and less than 1,000,000 Relationships had been deleted. Given that my aim is
> to half the number of Relationships, it would take me over 100hours (1 week)
> to complete this process. In the worst case this is what I'll resort to, but
> I'd rather not if there's a better way.
>
> My questions are:
> (1) Can you think of an alternative, faster and still meaningful (maintain
> graph structure) way to reduce this graph size?
> (2) Using the same method I'm using now, are there some magical
> optimizations that will greatly improve performance?
>
> Thanks,
> Alex
>
_______________________________________________
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user

Reply via email to