Re: [Neo4j] How to boost performance?

Linan Wang Wed, 23 Nov 2011 07:56:42 -0800

hi,
i noticed that you were using db.getNodeById to retrieve the starting
node. the performance of this method call is quite different from real
world apps especially those have external unique id. basically you
need to index the external id yourself and get it via index call.
besides, i'd suggest you not to use Traversal api for best performance
if you know exactly what are you looking for, including how deep you
are going, and what relationship types you want to look at, it could
save some expensive property/relationship retrieval.
lastly, if your db is small enough and your memory is big enough,
consider to dump the full db into memory fs. it'll boost the
performance 10x and save lots of time on configuration tuning.


On Wed, Nov 23, 2011 at 3:09 PM, Vinicius Carvalho
<java.vinic...@gmail.com> wrote:
> Hi Michael, this is going to be a newbie question, so please forgive me:
>
> I've re ran the tests with your examples, and using a embedded database.
> First thing: Whooping FAST! Mind blowing :D -> 5ms
>
> But ... I got different results, same time though which is great, proves
> the exact thing that happened on my local machine 1k nodes 5ms 250k nodes
> 5ms :D
>
> Using cypher on the console
> start n = node(3) match n-->()-->(x) return x
>
> I got 6475 nodes, which seems to be right, as every node have around 80
> relations, so 80*80 would give me this.
>
> Using your first example (I probably got it wrong) with the new traversal:
>
> Node startNode = db.getNodeById(Long.valueOf(id));
> TraversalDescription traversalQuery =
> Traversal.description().evaluator(Evaluators.atDepth(2)).expand(Traversal.expanderForAllTypes(Direction.OUTGOING));
> long start = System.currentTimeMillis();
> for(Node n : traversalQuery.traverse(startNode).nodes()){
> count++;
> }
>  long end = System.currentTimeMillis();
> return "Fetched " + count + " nodes in " + (end-start) + " ms";
>
> It returns 196 nodes in 5ms
>
> And using the second one:
>
> Node startNode=db.getNodeById(3);
> long start = System.currentTimeMillis();
> for (Relationship rel : startNode.getRelationships()) {
>   Node other = rel.getOtherNode(startNode);
>   for(Relationship rr : other.getRelationships()){
>   count++;
>   }
> }
> long end = System.currentTimeMillis();
> return "Fetched " + count + " nodes in " + (end-start) + " ms";
>
> Returns 25896 nodes in 5ms as well.
>
> Just trying to understand why I've got different results, again really
> newbie question, I'll dive into the docs of traversal a bit further, but if
> you could share a thought here would be great.
>
> Thanks
>
>
> On Wed, Nov 23, 2011 at 2:21 PM, Vinicius Carvalho
> <java.vinic...@gmail.com>wrote:
>
>> Tks, for this test it's just a readonly graph now, so I don't think I'll
>> run into synchronization issues. As we proceed with tests, I do hope that
>> we will have one day is a HA version of neo4j. And as Jim's said in that
>> thread, use it for other to read the graph.
>>
>> Regards
>>
>>
>> On Wed, Nov 23, 2011 at 2:15 PM, Michael Hunger <
>> michael.hun...@neotechnology.com> wrote:
>>
>>> Just make sure that it is just a snapshot of the data and doesn't update
>>> its caches.
>>>
>>> Otherwise you will run into synchronization issues.
>>>
>>> See also this thread and Tobias' explanations around it:
>>>
>>> http://neo4j-community-discussions.438527.n3.nabble.com/Neo4j-Synchronization-of-EmbeddedReadOnlyGraphDatabase-Bug-td3174626.html#a3213450
>>>
>>> Michael
>>>
>>> Am 23.11.2011 um 15:05 schrieb Vinicius Carvalho:
>>>
>>> > But wouldn't it mean that I need to have exclusive lock on the db? I
>>> would
>>> > like to keep the server running pointing at the same data directory.
>>> >
>>> > Regards
>>> >
>>> > On Wed, Nov 23, 2011 at 1:50 PM, Michael Hunger <
>>> > michael.hun...@neotechnology.com> wrote:
>>> >
>>> >> Please use EmbeddedGraphDatabase,
>>> >>
>>> >> EmbeddedReadOnlyGraphDatabase caches a snapshot of the data in its
>>> caches
>>> >> and doesn't get update-changes.
>>> >>
>>> >> Michael
>>> >>
>>> >> Am 23.11.2011 um 14:39 schrieb Vinicius Carvalho:
>>> >>
>>> >>> Hi Michael, thanks. The data load was fine, I've used your script with
>>> >> the
>>> >>> BathInserter. Memory footprint was really slow, I think the peak was
>>> >> 200mb
>>> >>> of heap usage. I did something really retarded and left a logger.info
>>> ,
>>> >>> which slowed things a bit, but the process was really smooth.
>>> >>>
>>> >>> Many thanks on the help with the query. I'll try this, I'm putting the
>>> >>> readonlyembedded neo inside our app right now. I expect to see some
>>> good
>>> >>> performance boost :)
>>> >>>
>>> >>> Best Regards
>>> >>>
>>> >>> On Wed, Nov 23, 2011 at 12:12 PM, Michael Hunger <
>>> >>> michael.hun...@neotechnology.com> wrote:
>>> >>>
>>> >>>> Vinicius,
>>> >>>>
>>> >>>> first: did you have any issues importing the data into Neo4j?
>>> >>>> second: your example used cypher which is not optimized for
>>> performance
>>> >>>> (yet!). This is in our plans for the next two releases of neo4j.
>>> >>>>
>>> >>>> So if you want to see the real performance of neo4j, please use the
>>> >>>> traversal framework or the core-API:
>>> >>>>
>>> >>>> Cypher & Traversals:
>>> >>>>
>>> >>>> // define
>>> >>>> cypherQuery = cypherParser.parse("start n=node({start_node}) match
>>> >>>> n-->()-->x return x")
>>> >>>> traversalQuery =
>>> >>>>
>>> >>
>>> Traversal.description().evaluator(Evaluators.atDepth(2)).expand(Traversal.expanderForAllTypes(Direction.OUTGOING))
>>> >>>>
>>> >>>> // execute
>>> >>>> for (Node n : cypherQuery.execute({"start_node":startNode})) { ... }
>>> >>>> for (Node n : traversalQuery.traverse(startNode).nodes()) { ... }
>>> >>>>
>>> >>>> If you're interested in the paths, remove the ".nodes()" call at the
>>> >>>> traverser
>>> >>>>
>>> >>>> In java core-api code:
>>> >>>>
>>> >>>> Node start=db.getNodeById(3);
>>> >>>>
>>> >>>> for (Relationship rel=start.getRelationships()) {
>>> >>>>  Node second = rel.getOtherNode(start);
>>> >>>>  for (Relationship rel=second.getRelationships()) {
>>> >>>>      Node third = rel.getOtherNode(second);
>>> >>>>      // do something with the 3 nodes, 2 relationships which form
>>> your
>>> >>>> path
>>> >>>>  }
>>> >>>> }
>>> >>>>
>>> >>>> In the REST API the traversal would look like: (see
>>> >>>>
>>> >>
>>> http://docs.neo4j.org/chunked/snapshot/rest-api-traverse.html#rest-api-traversal-using-a-return-filter
>>> >>>> )
>>> >>>>  * POST http://localhost:7474/db/data/node/3/traverse/node
>>> >>>>  * Accept: application/json
>>> >>>>  * Content-Type: application/json
>>> >>>>
>>> >>>> {
>>> >>>> "relationships" : [ {"direction" : "out" } ],
>>> >>>> "max_depth" : 3
>>> >>>> }
>>> >>>>
>>> >>>>
>>> >>>> Am 23.11.2011 um 11:54 schrieb Vinicius Carvalho:
>>> >>>>
>>> >>>>> Hi there, I've posted a few days ago about the POC I'm doing here
>>> at my
>>> >>>>> company. I have some initial numbers and I'd like to ask for some
>>> help
>>> >>>> here
>>> >>>>> in order to promote neo4j here in LMI Ericsson.
>>> >>>>>
>>> >>>>> I've loaded a mySQL db with a really simple entity, that pretty much
>>> >> only
>>> >>>>> represents a node and relations (only properties it has is an UID
>>> and
>>> >> x/y
>>> >>>>> space coordinate for each node)
>>> >>>>>
>>> >>>>> The DB contains 250.000 cells and 19. relations stored in a myISAM
>>> >> table,
>>> >>>>> indexed only by it's primary key. Please find the DDL for the two
>>> >> tables.
>>> >>>>>
>>> >>>>> CREATE TABLE  `pci`.`cells` (
>>> >>>>> `id` varchar(32) collate utf8_bin NOT NULL,
>>> >>>>> `x_pos` double default NULL,
>>> >>>>> `y_pos` double default NULL,
>>> >>>>> `pci` smallint(6) default '0',
>>> >>>>> PRIMARY KEY  (`id`)
>>> >>>>> )
>>> >>>>>
>>> >>>>> CREATE TABLE  `pci`.`relations` (
>>> >>>>> `id` int(11) NOT NULL auto_increment,
>>> >>>>> `source` varchar(32) collate utf8_bin default NULL,
>>> >>>>> `target` varchar(32) collate utf8_bin default NULL,
>>> >>>>> PRIMARY KEY  (`id`),
>>> >>>>> KEY `src_idx` (`source`),
>>> >>>>> KEY `src_target` (`target`)
>>> >>>>> )
>>> >>>>>
>>> >>>>> So as you can see, a simple secondary table contains the
>>> relationship
>>> >>>> with
>>> >>>>> source and targets pointing to the cells table.
>>> >>>>>
>>> >>>>> I've loaded this exact same DB into a neoserver running on the same
>>> >>>>> machine: A Blade with 26 cpus (6 cores each) and 16gb RAM.
>>> >>>>>
>>> >>>>> One of the requirements we have is to find all associations of my
>>> >>>>> associations. Something that in neo I did like this:
>>> >>>>>
>>> >>>>> START n = node(3)
>>> >>>>> MATCH n-->()-->(x)
>>> >>>>> return x
>>> >>>>>
>>> >>>>> For this specific node it returns 6475 nodes.
>>> >>>>>
>>> >>>>> I have tested this before using Hibernate in two modes: without a L2
>>> >>>> cache,
>>> >>>>> and with an L2 Cache (Ehcache standalone no replication).
>>> >>>>> Here's a snippet of the code that loads it, so you can understand
>>> >> what's
>>> >>>>> going under the hood:
>>> >>>>>
>>> >>>>>
>>> >>>>> @Override
>>> >>>>> public List<Cell> loadCellWithRealtions(String... ids) {
>>> >>>>> Session session = (Session) em.getDelegate();
>>> >>>>> Criteria c = session.createCriteria(Cell.class)
>>> >>>>> .setFetchMode("incomingRelations", FetchMode.SELECT)
>>> >>>>> .setFetchMode("outgoingRelations", FetchMode.SELECT)
>>> >>>>> .add(Restrictions.in("id", Arrays.asList(ids)));
>>> >>>>> List<Cell> results = c.list();
>>> >>>>> for(Cell cell : results){
>>> >>>>> Hibernate.initialize(cell.getIncomingRelations());
>>> >>>>> Hibernate.initialize(cell.getOutgoingRelations());
>>> >>>>> }
>>> >>>>> return results;
>>> >>>>> }
>>> >>>>>
>>> >>>>> @Override
>>> >>>>> public List<Cell> loadCellWithNeighbourRelations(String... ids) {
>>> >>>>> List<Cell> cells = loadCellWithRealtions(ids);
>>> >>>>> for(Cell c : cells){
>>> >>>>> for(Relation r : c.getIncomingRelations()){
>>> >>>>> Hibernate.initialize(r.getSource().getIncomingRelations());
>>> >>>>> Hibernate.initialize(r.getSource().getOutgoingRelations());
>>> >>>>> }
>>> >>>>> for(Relation r : c.getOutgoingRelations()){
>>> >>>>> Hibernate.initialize(r.getTarget().getIncomingRelations());
>>> >>>>> Hibernate.initialize(r.getTarget().getOutgoingRelations());
>>> >>>>> }
>>> >>>>> }
>>> >>>>> return cells;
>>> >>>>> }
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>> So the first method executes one query and 2 subselects to find a
>>> cell
>>> >>>> and
>>> >>>>> all relations, the second method, iterate over each relation and do
>>> the
>>> >>>>> same. So I pretty much will have something like 3+r*3 selects on db,
>>> >>>> where
>>> >>>>> r is the number of relations right.
>>> >>>>>
>>> >>>>> Ok, to be a bit fair with the tests, I've ran this for the same
>>> node 10
>>> >>>>> times (get a chance to warm the caches), exclude the longest and
>>> >> smallest
>>> >>>>> result, and then took a mean of it. Here's the results:
>>> >>>>>
>>> >>>>> EhCache: 70ms
>>> >>>>> Plain Hibernate: 550ms
>>> >>>>>
>>> >>>>> I still don't have a version of neo4j code running integrated in the
>>> >> app
>>> >>>>> server, but the idea is to use REST API. Running the query on the
>>> REST
>>> >>>> API
>>> >>>>> took over 2 seconds on average, but due the large size of the
>>> response,
>>> >>>>> network lagging was the issue. So I ran the same query 10 times
>>> using
>>> >> the
>>> >>>>> web console, and the average time for neo was 300ms
>>> >>>>>
>>> >>>>> Before asking anything I do know that we will have more complex
>>> queries
>>> >>>>> where neo will shine, but I need to improve those results in order
>>> to
>>> >>>> sell
>>> >>>>> it here :), with those numbers, ppl will just say that having a
>>> cache
>>> >> and
>>> >>>>> using Relational model would suffice.
>>> >>>>>
>>> >>>>> Anything I could do to improve this?
>>> >>>>>
>>> >>>>> Regards
>>> >>>>> _______________________________________________
>>> >>>>> Neo4j mailing list
>>> >>>>> User@lists.neo4j.org
>>> >>>>> https://lists.neo4j.org/mailman/listinfo/user
>>> >>>>
>>> >>>> _______________________________________________
>>> >>>> Neo4j mailing list
>>> >>>> User@lists.neo4j.org
>>> >>>> https://lists.neo4j.org/mailman/listinfo/user
>>> >>>>
>>> >>> _______________________________________________
>>> >>> Neo4j mailing list
>>> >>> User@lists.neo4j.org
>>> >>> https://lists.neo4j.org/mailman/listinfo/user
>>> >>
>>> >> _______________________________________________
>>> >> Neo4j mailing list
>>> >> User@lists.neo4j.org
>>> >> https://lists.neo4j.org/mailman/listinfo/user
>>> >>
>>> > _______________________________________________
>>> > Neo4j mailing list
>>> > User@lists.neo4j.org
>>> > https://lists.neo4j.org/mailman/listinfo/user
>>>
>>> _______________________________________________
>>> Neo4j mailing list
>>> User@lists.neo4j.org
>>> https://lists.neo4j.org/mailman/listinfo/user
>>>
>>
>>
> _______________________________________________
> Neo4j mailing list
> User@lists.neo4j.org
> https://lists.neo4j.org/mailman/listinfo/user
>



-- 
Best wishes,

Linan Wang
_______________________________________________
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user

Re: [Neo4j] How to boost performance?

Reply via email to