Re: [Neo4j] Speeding up initial import of graph

Paul Bandler Fri, 10 Jun 2011 00:02:43 -0700

On 9 Jun 2011, at 22:12, Michael Hunger wrote:

> Please keep in mind that the HashMap of 10M strings -> longs will take a 
> substantial amount of heap memory.
> That's not the fault of Neo4j :) On my system it alone takes 1.8 G of memory 
> (distributed across the strings, the hashmap-entries and the longs).



Fair enough,  but removing the Map and using the Index instead and setting the 
cache_type to weak makes almost no difference to the programs behaviour in 
terms of progressively consuming the heap until it fails.  I did this, 
including removal of the allocation of the Map, and watched to heap consumption 
follow a similar pattern until it failed as below.

>  Or you should perhaps use an amazon ec2 instance which you can easily get 
> with up to 68 G of RAM :)

With respect, and while I notice the smile, throwing memory at it is not an 
option for a large set of enterprise applications that might actually be 
willing to pay to use Neo4j if it didn't fail at the first hurdle when 
confronted with a trivial and small scale data load...

runImport failed after 2,072 seconds....
 
Creating data took 316 seconds
Physical mem: 1535MB, Heap size: 1016MB
use_memory_mapped_buffers=false
neostore.propertystore.db.index.keys.mapped_memory=1M
neostore.propertystore.db.strings.mapped_memory=52M
neostore.propertystore.db.arrays.mapped_memory=60M
neo_store=N:\TradeModel\target\hepper\neostore
neostore.relationshipstore.db.mapped_memory=76M
neostore.propertystore.db.index.mapped_memory=1M
neostore.propertystore.db.mapped_memory=62M
dump_configuration=true
cache_type=weak
neostore.nodestore.db.mapped_memory=17M
1000000 nodes created. Took 59906
2000000 nodes created. Took 64546
3000000 nodes created. Took 74577
4000000 nodes created. Took 82607
5000000 nodes created. Took 171091
Exception in thread "RMI TCP Connection(idle)" java.lang.OutOfMemoryError: Java 
heap space
        at java.io.BufferedOutputStream.<init>(Unknown Source)
        at java.io.BufferedOutputStream.<init>(Unknown Source)
        at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(Unknown 
Source)
        at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(Unknown 
Source)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown 
Source)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.lang.Thread.run(Unknown Source)
Exception in thread "RMI TCP Connection(idle)" java.lang.OutOfMemoryError: Java 
heap space
        at java.io.BufferedInputStream.<init>(Unknown Source)
        at java.io.BufferedInputStream.<init>(Unknown Source)
        at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(Unknown 
Source)
        at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(Unknown 
Source)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown 
Source)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.lang.Thread.run(Unknown Source)
Exception in thread "RMI TCP Connection(idle)" java.lang.OutOfMemoryError: Java 
heap space
        at java.io.BufferedOutputStream.<init>(Unknown Source)
        at java.io.BufferedOutputStream.<init>(Unknown Source)
        at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(Unknown 
Source)
        at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(Unknown 
Source)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown 
Source)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.lang.Thread.run(Unknown Source)
Exception in thread "RMI TCP Connection(idle)" java.lang.OutOfMemoryError: Java 
heap space
        at java.io.BufferedInputStream.<init>(Unknown Source)
        at java.io.BufferedInputStream.<init>(Unknown Source)
        at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(Unknown 
Source)
        at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(Unknown 
Source)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown 
Source)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.lang.Thread.run(Unknown Source)
 
 
 

> So 3 GB of heap are sensible to run this, that leaves about 1G for neo4j + 
> its caches.
> 
> Of course you're free to shard you map (e.g. by first letter of the name) and 
> persist those maps to disk and reload them if needed. But that's an 
> application level concern.
> If your are really limited that way wrt memory you should try Chris Giorans 
> implementation which will take care of that. Or you should perhaps use an 
> amazon ec2 instance which you can easily get with up to 68 G of RAM :)
> 
> Cheers
> 
> Michael
> 
> 
> P.S. As a side-note:
> For the rest of the memory:
> Have you tried to use weak reference cache instead of the default soft one?
> in your config.properties add
> cache_type = weak
> that should take care of your memory problems (and the stopping which is 
> actually the GC trying to reclaim memory).
> 
> Am 09.06.2011 um 22:36 schrieb Paul Bandler:
> 
>> I ran Michael’s  example test import program with the Map replacing the 
>> index on my on more modestly configured machine to see whether the import 
>> scaling problems I have reported previously using Batchinserter were 
>> reproduced.  They were – I gave the program 1G of heap and watched it run 
>> using jconsole.  It ran reasonably quickly as it consumed the in an almost 
>> straight line until it neared its capacity then practically stopped for 
>> about 20 minutes after which it died with an out of memory error – see below.
>> 
>> Now I’m not saying that Neo4j should necessarily go out of its way to 
>> support very memory constrained environments, but I do think that it is not 
>> unreasonable to expect its batch import mechanism not to fall over in this 
>> way but should rather flush its buffers or whatever without requiring the 
>> import application writer to shut it down and restart it periodically...
>> 
>> Creating data took 331 seconds
>> 1000000 nodes created. Took 29001
>> 2000000 nodes created. Took 35107
>> 3000000 nodes created. Took 35904
>> 4000000 nodes created. Took 66169
>> 5000000 nodes created. Took 63280
>> 6000000 nodes created. Took 183922
>> 7000000 nodes created. Took 258276
>> 
>> com.nomura.smo.rdm.neo4j.restore.Hepper
>> createData(330.364seconds)
>> runImport (1,485 seconds later...)
>> java.lang.OutOfMemoryError: Java heap space
>>       at java.util.ArrayList.<init>(Unknown Source)
>>       at java.util.ArrayList.<init>(Unknown Source)
>>       at 
>> org.neo4j.kernel.impl.nioneo.store.PropertyRecord.<init>(PropertyRecord.java:33)
>>       at 
>> org.neo4j.kernel.impl.batchinsert.BatchInserterImpl.createPropertyChain(BatchInserterImpl.java:425)
>>       at 
>> org.neo4j.kernel.impl.batchinsert.BatchInserterImpl.createNode(BatchInserterImpl.java:143)
>>       at com.nomura.smo.rdm.neo4j.restore.Hepper.runImport(Hepper.java:61)
>>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>       at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
>>       at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>>       at java.lang.reflect.Method.invoke(Unknown Source)
>>       at 
>> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
>>       at 
>> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
>>       at 
>> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
>>       at 
>> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
>>       at 
>> org.junit.runners.BlockJUnit4ClassRunner.runNotIgnored(BlockJUnit4ClassRunner.java:79)
>>       at 
>> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:71)
>>       at 
>> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:49)
>>       at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193)
>>       at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52)
>>       at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191)
>>       at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42)
>>       at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184)
>>       at org.junit.runners.ParentRunner.run(ParentRunner.java:236)
>>       at 
>> org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:49)
>>       at 
>> org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
>>       at 
>> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
>>       at 
>> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
>>       at 
>> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
>>       at 
>> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)
>> 
>> 
>> Regards,
>> Paul Bandler 
>> On 9 Jun 2011, at 12:27, Michael Hunger wrote:
>> 
>>> I recreated Daniels code in Java, mainly because some things were missing 
>>> from his scala example.
>>> 
>>> You're right that the index is the bottleneck. But with your small data set 
>>> it should be possible to cache the 10m nodes in a heap that fits in your 
>>> machine.
>>> 
>>> I ran it first with the index and had about 8 seconds / 1M nodes and 320 
>>> sec/1M rels.
>>> 
>>> Then I switched to 3G heap and a HashMap to keep the name=>node lookup and 
>>> it went to 2s/1M nodes and 13 down-to 3 sec for 1M rels.
>>> 
>>> That is the approach that Chris takes only that his solution can persist 
>>> the map to disk and is more efficient :)
>>> 
>>> Hope that helps.
>>> 
>>> Michael
>>> 
>>> package org.neo4j.load;
>>> 
>>> import org.apache.commons.io.FileUtils;
>>> import org.junit.Test;
>>> import org.neo4j.graphdb.RelationshipType;
>>> import org.neo4j.graphdb.index.BatchInserterIndex;
>>> import org.neo4j.graphdb.index.BatchInserterIndexProvider;
>>> import org.neo4j.helpers.collection.MapUtil;
>>> import org.neo4j.index.impl.lucene.LuceneBatchInserterIndexProvider;
>>> import org.neo4j.kernel.impl.batchinsert.BatchInserter;
>>> import org.neo4j.kernel.impl.batchinsert.BatchInserterImpl;
>>> 
>>> import java.io.*;
>>> import java.util.HashMap;
>>> import java.util.Map;
>>> import java.util.Random;
>>> 
>>> /**
>>> * @author mh
>>> * @since 09.06.11
>>> */
>>> public class Hepper {
>>> 
>>>  public static final int REPORT_COUNT = Config.MILLION;
>>> 
>>>  enum MyRelationshipTypes implements RelationshipType {
>>>      BELONGS_TO
>>>  }
>>> 
>>>  public static final int COUNT = Config.MILLION * 10;
>>> 
>>>  @Test
>>>  public void createData() throws IOException {
>>>      long time = System.currentTimeMillis();
>>>      final PrintWriter writer = new PrintWriter(new BufferedWriter(new 
>>> FileWriter("data.txt")));
>>>      Random r = new Random(-1L);
>>>      for (int nodes = 0; nodes < COUNT; nodes++) {
>>>          writer.printf("%07d|%07d|%07d%n", nodes, r.nextInt(COUNT), 
>>> r.nextInt(COUNT));
>>>      }
>>>      writer.close();
>>>      System.out.println("Creating data took "+ (System.currentTimeMillis() 
>>> - time) / 1000 +" seconds");
>>>  }
>>> 
>>>  @Test
>>>  public void runImport() throws IOException {
>>>      Map<String,Long> cache=new HashMap<String, Long>(COUNT);
>>>      final File storeDir = new File("target/hepper");
>>>      FileUtils.deleteDirectory(storeDir);
>>>      BatchInserter inserter = new 
>>> BatchInserterImpl(storeDir.getAbsolutePath());
>>>      final BatchInserterIndexProvider indexProvider = new 
>>> LuceneBatchInserterIndexProvider(inserter);
>>>      final BatchInserterIndex index = indexProvider.nodeIndex("pages", 
>>> MapUtil.stringMap("type", "exact"));
>>>      BufferedReader reader = new BufferedReader(new FileReader("data.txt"));
>>>      String line = null;
>>>      int nodes = 0;
>>>      long time = System.currentTimeMillis();
>>>      long batchTime=time;
>>>      while ((line = reader.readLine()) != null) {
>>>          final String[] nodeNames = line.split("\\|");
>>>          final String name = nodeNames[0];
>>>          final Map<String, Object> props = MapUtil.map("name", name);
>>>          final long node = inserter.createNode(props);
>>>          //index.add(node, props);
>>>          cache.put(name,node);
>>>          nodes++;
>>>          if ((nodes % REPORT_COUNT) == 0) {
>>>              System.out.printf("%d nodes created. Took %d %n", nodes, 
>>> (System.currentTimeMillis() - batchTime));
>>>              batchTime = System.currentTimeMillis();
>>>          }
>>>      }
>>> 
>>>      System.out.println("Creating nodes took "+ (System.currentTimeMillis() 
>>> - time) / 1000);
>>>      index.flush();
>>>      reader.close();
>>>      reader = new BufferedReader(new FileReader("data.txt"));
>>>      int rels = 0;
>>>      time = System.currentTimeMillis();
>>>      batchTime=time;
>>>      while ((line = reader.readLine()) != null) {
>>>          final String[] nodeNames = line.split("\\|");
>>>          final String name = nodeNames[0];
>>>          //final Long from = index.get("name", name).getSingle();
>>>          Long from =cache.get(name);
>>>          for (int j = 1; j < nodeNames.length; j++) {
>>>              //final Long to = index.get("name", nodeNames[j]).getSingle();
>>>              final Long to = cache.get(name);
>>>              inserter.createRelationship(from, to, 
>>> MyRelationshipTypes.BELONGS_TO,null);
>>>          }
>>>          rels++;
>>>          if ((rels % REPORT_COUNT) == 0) {
>>>              System.out.printf("%d relationships created. Took %d %n", 
>>> rels, (System.currentTimeMillis() - batchTime));
>>>              batchTime = System.currentTimeMillis();
>>>          }
>>>      }
>>>      System.out.println("Creating relationships took "+ 
>>> (System.currentTimeMillis() - time) / 1000);
>>>  }
>>> }
>>> 
>>> 
>>> 1000000 nodes created. Took 2227 
>>> 2000000 nodes created. Took 1930 
>>> 3000000 nodes created. Took 1818 
>>> 4000000 nodes created. Took 1966 
>>> 5000000 nodes created. Took 1857 
>>> 6000000 nodes created. Took 2009 
>>> 7000000 nodes created. Took 2068 
>>> 8000000 nodes created. Took 1991 
>>> 9000000 nodes created. Took 2151 
>>> 10000000 nodes created. Took 2276 
>>> Creating nodes took 20
>>> 1000000 relationships created. Took 13441 
>>> 2000000 relationships created. Took 12887 
>>> 3000000 relationships created. Took 12922 
>>> 4000000 relationships created. Took 13149 
>>> 5000000 relationships created. Took 14177 
>>> 6000000 relationships created. Took 3377 
>>> 7000000 relationships created. Took 2932 
>>> 8000000 relationships created. Took 2991 
>>> 9000000 relationships created. Took 2992 
>>> 10000000 relationships created. Took 2912 
>>> Creating relationships took 81
>>> 
>>> Am 09.06.2011 um 12:51 schrieb Chris Gioran:
>>> 
>>>> Hi Daniel,
>>>> 
>>>> I am working currently on a tool for importing big data sets into Neo4j 
>>>> graphs.
>>>> The main problem in such operations is that the usual index
>>>> implementations are just too
>>>> slow for retrieving the mapping from keys to created node ids, so a
>>>> custom solution is
>>>> needed, that is dependent to a varying degree on the distribution of
>>>> values of the input set.
>>>> 
>>>> While your dataset is smaller than the data sizes i deal with, i would
>>>> like to use it as a test case. If you could
>>>> provide somehow the actual data or something that emulates them, I
>>>> would be grateful.
>>>> 
>>>> If you want to see my approach, it is available here
>>>> 
>>>> https://github.com/digitalstain/BigDataImport
>>>> 
>>>> The core algorithm is an XJoin style two-level-hashing scheme with
>>>> adaptable eviction strategies but it is not production ready yet,
>>>> mainly from an API perspective.
>>>> 
>>>> You can contact me directly for any details regarding this issue.
>>>> 
>>>> cheers,
>>>> CG
>>>> 
>>>> On Thu, Jun 9, 2011 at 12:59 PM, Daniel Hepper <[email protected]> 
>>>> wrote:
>>>>> Hi all,
>>>>> 
>>>>> I'm struggling with importing a graph with about 10m nodes and 20m
>>>>> relationships, with nodes having 0 to 10 relationships. Creating the
>>>>> nodes takes about 10 minutes, but creating the relationships is slower
>>>>> by several orders of magnitude. I'm using a 2.4 GHz i7 MacBookPro with
>>>>> 4GB RAM and conventional HDD.
>>>>> 
>>>>> The graph is stored as adjacency list in a text file where each line
>>>>> has this form:
>>>>> 
>>>>> Foo|Bar|Baz
>>>>> (Node Foo has relations to Bar and Baz)
>>>>> 
>>>>> My current approach is to iterate over the whole file twice. In the
>>>>> first run, I create a node with the property "name" for the first
>>>>> entry in the line (Foo in this case) and add it to an index.
>>>>> In the second run, I get the start node and the end nodes from the
>>>>> index by name and create the relationships.
>>>>> 
>>>>> My code can be found here: http://pastie.org/2041801
>>>>> 
>>>>> With my approach, the best I can achieve is 100 created relationships
>>>>> per second.
>>>>> I experimented with mapped memory settings, but without much effect.
>>>>> Is this the speed I can expect?
>>>>> Any advice on how to speed up this process?
>>>>> 
>>>>> Best regards,
>>>>> Daniel Hepper
>>>>> _______________________________________________
>>>>> Neo4j mailing list
>>>>> [email protected]
>>>>> https://lists.neo4j.org/mailman/listinfo/user
>>>>> 
>>>> _______________________________________________
>>>> Neo4j mailing list
>>>> [email protected]
>>>> https://lists.neo4j.org/mailman/listinfo/user
>>> 
>>> _______________________________________________
>>> Neo4j mailing list
>>> [email protected]
>>> https://lists.neo4j.org/mailman/listinfo/user
>> 
>> _______________________________________________
>> Neo4j mailing list
>> [email protected]
>> https://lists.neo4j.org/mailman/listinfo/user
> 
> _______________________________________________
> Neo4j mailing list
> [email protected]
> https://lists.neo4j.org/mailman/listinfo/user

_______________________________________________
Neo4j mailing list
[email protected]
https://lists.neo4j.org/mailman/listinfo/user

Re: [Neo4j] Speeding up initial import of graph

Reply via email to