Hi Peter,
Yeah it's under control now.
Cheers,
Alex
On Thu, May 27, 2010 at 5:36 PM, Peter Neubauer
peter.neuba...@neotechnology.com wrote:
Alex,
not an expert on delete performance but that looks ok to me, and is
workable for you now?
Cheers,
/peter neubauer
COO and Sales, Neo Technology
GTalk: neubauer.peter
Skype peter.neubauer
Phone +46 704 106975
LinkedIn http://www.linkedin.com/in/neubauer
Twitter http://twitter.com/peterneubauer
http://www.neo4j.org - Your high performance graph database.
http://www.thoughtmade.com - Scandinavia's coolest Bring-a-Thing party.
On Tue, May 25, 2010 at 9:51 AM, Alex Averbuch alex.averb...@gmail.com
wrote:
OK, seems to be running much faster now.
The reason for such slow performance was mine, not Neo4j's.
Before the input parameters to my function allowed the user to specify
exactly how many Relationships should be removed. This meant in order to
be
fair I had to uniformly randomly generate ID's in the key space (also
defined by input parameter). The problem with this approach is as time
passes it becomes more likely to select an ID that has already been
removed.
I had checks to avoid errors as a result, but it still meant many wasted
disk reads.
//BEFORE
until (graph_is_small_enough)
random_number = generate_uniform(0, maxId)
random_relationship = get_relationship_by_id(random_number)
random_relationship.delete()
Now the input parameters only let the user specify the PERCENTAGE of all
Relationships that should be kept.
This means I don't need to keep an state that tells me which ID's have
already been deleted, and I can iterate through all Relationships and
never
have a missed read (assuming there are no wholes in the key space).
//NOW
for (index=0 to maxId)
random_number = generate_uniform(0, maxId)
if (random_number percent_to_keep)
continue;
random_relationship = get_relationship_by_id(index)
random_relationship.delete()
Performance is much better now. The first 1,000,000 deletions took
~4minutes
Cheers,
Alex
On Mon, May 24, 2010 at 11:24 PM, Alex Averbuch alex.averb...@gmail.com
wrote:
Hey,
I have a large (by my standards) graph and I would like to reduce it's
size
so it all fits in memory.
This is that same Twitter graph as I mentioned earlier: 2.5million Nodes
250million Relationships.
The goal is for the graph to still have the same topology and
characteristics after it has been made more sparse.
My plan to do this was to uniformly randomly select Relationships for
deletion, until the graph is small enough.
My first approach is basically this:
until (graph_is_small_enough)
random_relationship = get_relationship_by_id(random_number)
random_relationship.delete()
I'm using the transactional GraphDatabaseService at the moment, rather
than
the BatchInserter... mostly because I'm not inserting anything and I
assumed
the optimizations made to the BatchInserter were only for write
operations.
The reason I want to delete Relationships instead of Nodes is
(1) I don't want to accidentally delete any super nodes, as these
are
what gives Twitter it's unique structure
(2) The number of Nodes is not the main problem that's keeping me from
being able to store the graph in RAM
The problem with the current approach is that it feels like I'm working
against Neo4j's strengths and it is very very slow... I waited over an
hour
and less than 1,000,000 Relationships had been deleted. Given that my
aim is
to half the number of Relationships, it would take me over 100hours (1
week)
to complete this process. In the worst case this is what I'll resort to,
but
I'd rather not if there's a better way.
My questions are:
(1) Can you think of an alternative, faster and still meaningful
(maintain
graph structure) way to reduce this graph size?
(2) Using the same method I'm using now, are there some magical
optimizations that will greatly improve performance?
Thanks,
Alex
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user