Re: [Neo4j] Speeding up initial import of graph
On 9 Jun 2011, at 22:12, Michael Hunger wrote: Please keep in mind that the HashMap of 10M strings - longs will take a substantial amount of heap memory. That's not the fault of Neo4j :) On my system it alone takes 1.8 G of memory (distributed across the strings, the hashmap-entries and the longs). Fair enough, but removing the Map and using the Index instead and setting the cache_type to weak makes almost no difference to the programs behaviour in terms of progressively consuming the heap until it fails. I did this, including removal of the allocation of the Map, and watched to heap consumption follow a similar pattern until it failed as below. Or you should perhaps use an amazon ec2 instance which you can easily get with up to 68 G of RAM :) With respect, and while I notice the smile, throwing memory at it is not an option for a large set of enterprise applications that might actually be willing to pay to use Neo4j if it didn't fail at the first hurdle when confronted with a trivial and small scale data load... runImport failed after 2,072 seconds Creating data took 316 seconds Physical mem: 1535MB, Heap size: 1016MB use_memory_mapped_buffers=false neostore.propertystore.db.index.keys.mapped_memory=1M neostore.propertystore.db.strings.mapped_memory=52M neostore.propertystore.db.arrays.mapped_memory=60M neo_store=N:\TradeModel\target\hepper\neostore neostore.relationshipstore.db.mapped_memory=76M neostore.propertystore.db.index.mapped_memory=1M neostore.propertystore.db.mapped_memory=62M dump_configuration=true cache_type=weak neostore.nodestore.db.mapped_memory=17M 100 nodes created. Took 59906 200 nodes created. Took 64546 300 nodes created. Took 74577 400 nodes created. Took 82607 500 nodes created. Took 171091 Exception in thread RMI TCP Connection(idle) java.lang.OutOfMemoryError: Java heap space at java.io.BufferedOutputStream.init(Unknown Source) at java.io.BufferedOutputStream.init(Unknown Source) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(Unknown Source) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) Exception in thread RMI TCP Connection(idle) java.lang.OutOfMemoryError: Java heap space at java.io.BufferedInputStream.init(Unknown Source) at java.io.BufferedInputStream.init(Unknown Source) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(Unknown Source) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) Exception in thread RMI TCP Connection(idle) java.lang.OutOfMemoryError: Java heap space at java.io.BufferedOutputStream.init(Unknown Source) at java.io.BufferedOutputStream.init(Unknown Source) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(Unknown Source) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) Exception in thread RMI TCP Connection(idle) java.lang.OutOfMemoryError: Java heap space at java.io.BufferedInputStream.init(Unknown Source) at java.io.BufferedInputStream.init(Unknown Source) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(Unknown Source) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) So 3 GB of heap are sensible to run this, that leaves about 1G for neo4j + its caches. Of course you're free to shard you map (e.g. by first letter of the name) and persist those maps to disk and reload them if needed. But that's an application level concern. If your are really limited that way wrt memory you should try Chris Giorans implementation which will take care of that. Or you should perhaps use an amazon ec2 instance which you can easily get with up to 68 G of RAM :) Cheers Michael P.S. As a side-note: For the rest of the memory: Have you tried to use weak reference cache instead of the default soft one? in your config.properties add cache_type = weak that should take care of your memory problems (and the stopping which is actually the GC trying to reclaim memory). Am
Re: [Neo4j] Speeding up initial import of graph
You're right the lucene based import shouldn't fail for memory problems, I will look into that. My suggestion is valid if you want to use an in memory map to speed up the import. And if you're able to perhaps analyze / partition your data that might be a viable solution. Will get back to you with the findings later. Michael Am 10.06.2011 um 09:02 schrieb Paul Bandler: On 9 Jun 2011, at 22:12, Michael Hunger wrote: Please keep in mind that the HashMap of 10M strings - longs will take a substantial amount of heap memory. That's not the fault of Neo4j :) On my system it alone takes 1.8 G of memory (distributed across the strings, the hashmap-entries and the longs). Fair enough, but removing the Map and using the Index instead and setting the cache_type to weak makes almost no difference to the programs behaviour in terms of progressively consuming the heap until it fails. I did this, including removal of the allocation of the Map, and watched to heap consumption follow a similar pattern until it failed as below. Or you should perhaps use an amazon ec2 instance which you can easily get with up to 68 G of RAM :) With respect, and while I notice the smile, throwing memory at it is not an option for a large set of enterprise applications that might actually be willing to pay to use Neo4j if it didn't fail at the first hurdle when confronted with a trivial and small scale data load... runImport failed after 2,072 seconds Creating data took 316 seconds Physical mem: 1535MB, Heap size: 1016MB use_memory_mapped_buffers=false neostore.propertystore.db.index.keys.mapped_memory=1M neostore.propertystore.db.strings.mapped_memory=52M neostore.propertystore.db.arrays.mapped_memory=60M neo_store=N:\TradeModel\target\hepper\neostore neostore.relationshipstore.db.mapped_memory=76M neostore.propertystore.db.index.mapped_memory=1M neostore.propertystore.db.mapped_memory=62M dump_configuration=true cache_type=weak neostore.nodestore.db.mapped_memory=17M 100 nodes created. Took 59906 200 nodes created. Took 64546 300 nodes created. Took 74577 400 nodes created. Took 82607 500 nodes created. Took 171091 Exception in thread RMI TCP Connection(idle) java.lang.OutOfMemoryError: Java heap space at java.io.BufferedOutputStream.init(Unknown Source) at java.io.BufferedOutputStream.init(Unknown Source) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(Unknown Source) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) Exception in thread RMI TCP Connection(idle) java.lang.OutOfMemoryError: Java heap space at java.io.BufferedInputStream.init(Unknown Source) at java.io.BufferedInputStream.init(Unknown Source) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(Unknown Source) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) Exception in thread RMI TCP Connection(idle) java.lang.OutOfMemoryError: Java heap space at java.io.BufferedOutputStream.init(Unknown Source) at java.io.BufferedOutputStream.init(Unknown Source) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(Unknown Source) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) Exception in thread RMI TCP Connection(idle) java.lang.OutOfMemoryError: Java heap space at java.io.BufferedInputStream.init(Unknown Source) at java.io.BufferedInputStream.init(Unknown Source) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(Unknown Source) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) So 3 GB of heap are sensible to run this, that leaves about 1G for neo4j + its caches. Of course you're free to shard you map (e.g. by first letter of the name) and persist those maps to disk and reload them if needed. But that's an application level concern. If your are really limited that way wrt memory you should try Chris Giorans implementation which will take care of that.
[Neo4j] Speeding up initial import of graph
Hi all, I'm struggling with importing a graph with about 10m nodes and 20m relationships, with nodes having 0 to 10 relationships. Creating the nodes takes about 10 minutes, but creating the relationships is slower by several orders of magnitude. I'm using a 2.4 GHz i7 MacBookPro with 4GB RAM and conventional HDD. The graph is stored as adjacency list in a text file where each line has this form: Foo|Bar|Baz (Node Foo has relations to Bar and Baz) My current approach is to iterate over the whole file twice. In the first run, I create a node with the property name for the first entry in the line (Foo in this case) and add it to an index. In the second run, I get the start node and the end nodes from the index by name and create the relationships. My code can be found here: http://pastie.org/2041801 With my approach, the best I can achieve is 100 created relationships per second. I experimented with mapped memory settings, but without much effect. Is this the speed I can expect? Any advice on how to speed up this process? Best regards, Daniel Hepper ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] Speeding up initial import of graph
I too am experiencing similar problems - possibly worse than you're seeing as I am using a very modestly provisioned windows m/c (1.5Gb ram, setting max heap to 1Gb, oldish processor). I found that using the BatchInserter for loading nodes the heap grew and grew until when it was exhausted everything ground to a halt practically. I experimented with various settings of the cache memory but nothing made much difference. So now I reset the BatchInserter (i.e. shutdown and re-start it) ever 100,000 nodes or so. I posted questions on the list before but the replies seemed to suggest that it was just a config issue - but no config changes I made helped much. I get the impression that most people are using Neo4j with hugely larger memory footprints than I can realistically expect to use at this stage and so maybe that is why this problem may not receive much attention. I have a similar approach to you for relationships - i.e. creating them in a second pass. I'm not sure how memory hungry it is, but again have built a class that resets the inserters every 100,000 relationships. It is slow, but experimenting with my 'reset' size, didn't make much difference so I'm suspecting that its limited by index access time. Effectively I suspect it's going to disk for every index look up that it sees for the first time, and also suspect that the size of the index might make a difference as I have over 3m nodes in some indexes and these are the ones that are very slow. I suspect there might be some tuning that can be done, and I really think the problem with running out of heap is probably a bug that should be fixed, but am now turning my attention to finding ways of creating relationships when the initial nodes are created (at least for those for which this is possible) to avoid the index lookup overhead... I'll let you know if/how this helps, but am also interested to learn of others experience. On 9 Jun 2011, at 10:59, Daniel Hepper wrote: Hi all, I'm struggling with importing a graph with about 10m nodes and 20m relationships, with nodes having 0 to 10 relationships. Creating the nodes takes about 10 minutes, but creating the relationships is slower by several orders of magnitude. I'm using a 2.4 GHz i7 MacBookPro with 4GB RAM and conventional HDD. The graph is stored as adjacency list in a text file where each line has this form: Foo|Bar|Baz (Node Foo has relations to Bar and Baz) My current approach is to iterate over the whole file twice. In the first run, I create a node with the property name for the first entry in the line (Foo in this case) and add it to an index. In the second run, I get the start node and the end nodes from the index by name and create the relationships. My code can be found here: http://pastie.org/2041801 With my approach, the best I can achieve is 100 created relationships per second. I experimented with mapped memory settings, but without much effect. Is this the speed I can expect? Any advice on how to speed up this process? Best regards, Daniel Hepper ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] Speeding up initial import of graph
Hi Daniel, I am working currently on a tool for importing big data sets into Neo4j graphs. The main problem in such operations is that the usual index implementations are just too slow for retrieving the mapping from keys to created node ids, so a custom solution is needed, that is dependent to a varying degree on the distribution of values of the input set. While your dataset is smaller than the data sizes i deal with, i would like to use it as a test case. If you could provide somehow the actual data or something that emulates them, I would be grateful. If you want to see my approach, it is available here https://github.com/digitalstain/BigDataImport The core algorithm is an XJoin style two-level-hashing scheme with adaptable eviction strategies but it is not production ready yet, mainly from an API perspective. You can contact me directly for any details regarding this issue. cheers, CG On Thu, Jun 9, 2011 at 12:59 PM, Daniel Hepper daniel.hep...@gmail.com wrote: Hi all, I'm struggling with importing a graph with about 10m nodes and 20m relationships, with nodes having 0 to 10 relationships. Creating the nodes takes about 10 minutes, but creating the relationships is slower by several orders of magnitude. I'm using a 2.4 GHz i7 MacBookPro with 4GB RAM and conventional HDD. The graph is stored as adjacency list in a text file where each line has this form: Foo|Bar|Baz (Node Foo has relations to Bar and Baz) My current approach is to iterate over the whole file twice. In the first run, I create a node with the property name for the first entry in the line (Foo in this case) and add it to an index. In the second run, I get the start node and the end nodes from the index by name and create the relationships. My code can be found here: http://pastie.org/2041801 With my approach, the best I can achieve is 100 created relationships per second. I experimented with mapped memory settings, but without much effect. Is this the speed I can expect? Any advice on how to speed up this process? Best regards, Daniel Hepper ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] Speeding up initial import of graph
I recreated Daniels code in Java, mainly because some things were missing from his scala example. You're right that the index is the bottleneck. But with your small data set it should be possible to cache the 10m nodes in a heap that fits in your machine. I ran it first with the index and had about 8 seconds / 1M nodes and 320 sec/1M rels. Then I switched to 3G heap and a HashMap to keep the name=node lookup and it went to 2s/1M nodes and 13 down-to 3 sec for 1M rels. That is the approach that Chris takes only that his solution can persist the map to disk and is more efficient :) Hope that helps. Michael package org.neo4j.load; import org.apache.commons.io.FileUtils; import org.junit.Test; import org.neo4j.graphdb.RelationshipType; import org.neo4j.graphdb.index.BatchInserterIndex; import org.neo4j.graphdb.index.BatchInserterIndexProvider; import org.neo4j.helpers.collection.MapUtil; import org.neo4j.index.impl.lucene.LuceneBatchInserterIndexProvider; import org.neo4j.kernel.impl.batchinsert.BatchInserter; import org.neo4j.kernel.impl.batchinsert.BatchInserterImpl; import java.io.*; import java.util.HashMap; import java.util.Map; import java.util.Random; /** * @author mh * @since 09.06.11 */ public class Hepper { public static final int REPORT_COUNT = Config.MILLION; enum MyRelationshipTypes implements RelationshipType { BELONGS_TO } public static final int COUNT = Config.MILLION * 10; @Test public void createData() throws IOException { long time = System.currentTimeMillis(); final PrintWriter writer = new PrintWriter(new BufferedWriter(new FileWriter(data.txt))); Random r = new Random(-1L); for (int nodes = 0; nodes COUNT; nodes++) { writer.printf(%07d|%07d|%07d%n, nodes, r.nextInt(COUNT), r.nextInt(COUNT)); } writer.close(); System.out.println(Creating data took + (System.currentTimeMillis() - time) / 1000 + seconds); } @Test public void runImport() throws IOException { MapString,Long cache=new HashMapString, Long(COUNT); final File storeDir = new File(target/hepper); FileUtils.deleteDirectory(storeDir); BatchInserter inserter = new BatchInserterImpl(storeDir.getAbsolutePath()); final BatchInserterIndexProvider indexProvider = new LuceneBatchInserterIndexProvider(inserter); final BatchInserterIndex index = indexProvider.nodeIndex(pages, MapUtil.stringMap(type, exact)); BufferedReader reader = new BufferedReader(new FileReader(data.txt)); String line = null; int nodes = 0; long time = System.currentTimeMillis(); long batchTime=time; while ((line = reader.readLine()) != null) { final String[] nodeNames = line.split(\\|); final String name = nodeNames[0]; final MapString, Object props = MapUtil.map(name, name); final long node = inserter.createNode(props); //index.add(node, props); cache.put(name,node); nodes++; if ((nodes % REPORT_COUNT) == 0) { System.out.printf(%d nodes created. Took %d %n, nodes, (System.currentTimeMillis() - batchTime)); batchTime = System.currentTimeMillis(); } } System.out.println(Creating nodes took + (System.currentTimeMillis() - time) / 1000); index.flush(); reader.close(); reader = new BufferedReader(new FileReader(data.txt)); int rels = 0; time = System.currentTimeMillis(); batchTime=time; while ((line = reader.readLine()) != null) { final String[] nodeNames = line.split(\\|); final String name = nodeNames[0]; //final Long from = index.get(name, name).getSingle(); Long from =cache.get(name); for (int j = 1; j nodeNames.length; j++) { //final Long to = index.get(name, nodeNames[j]).getSingle(); final Long to = cache.get(name); inserter.createRelationship(from, to, MyRelationshipTypes.BELONGS_TO,null); } rels++; if ((rels % REPORT_COUNT) == 0) { System.out.printf(%d relationships created. Took %d %n, rels, (System.currentTimeMillis() - batchTime)); batchTime = System.currentTimeMillis(); } } System.out.println(Creating relationships took + (System.currentTimeMillis() - time) / 1000); } } 100 nodes created. Took 2227 200 nodes created. Took 1930 300 nodes created. Took 1818 400 nodes created. Took 1966 500 nodes created. Took 1857 600 nodes created. Took 2009 700 nodes created. Took 2068 800 nodes created. Took 1991 900 nodes created. Took 2151 1000 nodes created. Took 2276 Creating nodes took 20 100 relationships created. Took 13441 200 relationships created.
Re: [Neo4j] Speeding up initial import of graph
I will try caching the nodes in the heap as Michael suggested and I'll also look into Chris' tool. Thanks everybody for the effort and the suggestions! Daniel On Thu, Jun 9, 2011 at 1:27 PM, Michael Hunger michael.hun...@neotechnology.com wrote: I recreated Daniels code in Java, mainly because some things were missing from his scala example. You're right that the index is the bottleneck. But with your small data set it should be possible to cache the 10m nodes in a heap that fits in your machine. I ran it first with the index and had about 8 seconds / 1M nodes and 320 sec/1M rels. Then I switched to 3G heap and a HashMap to keep the name=node lookup and it went to 2s/1M nodes and 13 down-to 3 sec for 1M rels. That is the approach that Chris takes only that his solution can persist the map to disk and is more efficient :) Hope that helps. Michael package org.neo4j.load; import org.apache.commons.io.FileUtils; import org.junit.Test; import org.neo4j.graphdb.RelationshipType; import org.neo4j.graphdb.index.BatchInserterIndex; import org.neo4j.graphdb.index.BatchInserterIndexProvider; import org.neo4j.helpers.collection.MapUtil; import org.neo4j.index.impl.lucene.LuceneBatchInserterIndexProvider; import org.neo4j.kernel.impl.batchinsert.BatchInserter; import org.neo4j.kernel.impl.batchinsert.BatchInserterImpl; import java.io.*; import java.util.HashMap; import java.util.Map; import java.util.Random; /** * @author mh * @since 09.06.11 */ public class Hepper { public static final int REPORT_COUNT = Config.MILLION; enum MyRelationshipTypes implements RelationshipType { BELONGS_TO } public static final int COUNT = Config.MILLION * 10; @Test public void createData() throws IOException { long time = System.currentTimeMillis(); final PrintWriter writer = new PrintWriter(new BufferedWriter(new FileWriter(data.txt))); Random r = new Random(-1L); for (int nodes = 0; nodes COUNT; nodes++) { writer.printf(%07d|%07d|%07d%n, nodes, r.nextInt(COUNT), r.nextInt(COUNT)); } writer.close(); System.out.println(Creating data took + (System.currentTimeMillis() - time) / 1000 + seconds); } @Test public void runImport() throws IOException { MapString,Long cache=new HashMapString, Long(COUNT); final File storeDir = new File(target/hepper); FileUtils.deleteDirectory(storeDir); BatchInserter inserter = new BatchInserterImpl(storeDir.getAbsolutePath()); final BatchInserterIndexProvider indexProvider = new LuceneBatchInserterIndexProvider(inserter); final BatchInserterIndex index = indexProvider.nodeIndex(pages, MapUtil.stringMap(type, exact)); BufferedReader reader = new BufferedReader(new FileReader(data.txt)); String line = null; int nodes = 0; long time = System.currentTimeMillis(); long batchTime=time; while ((line = reader.readLine()) != null) { final String[] nodeNames = line.split(\\|); final String name = nodeNames[0]; final MapString, Object props = MapUtil.map(name, name); final long node = inserter.createNode(props); //index.add(node, props); cache.put(name,node); nodes++; if ((nodes % REPORT_COUNT) == 0) { System.out.printf(%d nodes created. Took %d %n, nodes, (System.currentTimeMillis() - batchTime)); batchTime = System.currentTimeMillis(); } } System.out.println(Creating nodes took + (System.currentTimeMillis() - time) / 1000); index.flush(); reader.close(); reader = new BufferedReader(new FileReader(data.txt)); int rels = 0; time = System.currentTimeMillis(); batchTime=time; while ((line = reader.readLine()) != null) { final String[] nodeNames = line.split(\\|); final String name = nodeNames[0]; //final Long from = index.get(name, name).getSingle(); Long from =cache.get(name); for (int j = 1; j nodeNames.length; j++) { //final Long to = index.get(name, nodeNames[j]).getSingle(); final Long to = cache.get(name); inserter.createRelationship(from, to, MyRelationshipTypes.BELONGS_TO,null); } rels++; if ((rels % REPORT_COUNT) == 0) { System.out.printf(%d relationships created. Took %d %n, rels, (System.currentTimeMillis() - batchTime)); batchTime = System.currentTimeMillis(); } } System.out.println(Creating relationships took + (System.currentTimeMillis() - time) / 1000); } } 100 nodes created. Took 2227 200 nodes created. Took 1930 300 nodes created. Took 1818 400 nodes created. Took 1966 500 nodes
Re: [Neo4j] Speeding up initial import of graph
I ran Michael’s example test import program with the Map replacing the index on my on more modestly configured machine to see whether the import scaling problems I have reported previously using Batchinserter were reproduced. They were – I gave the program 1G of heap and watched it run using jconsole. It ran reasonably quickly as it consumed the in an almost straight line until it neared its capacity then practically stopped for about 20 minutes after which it died with an out of memory error – see below. Now I’m not saying that Neo4j should necessarily go out of its way to support very memory constrained environments, but I do think that it is not unreasonable to expect its batch import mechanism not to fall over in this way but should rather flush its buffers or whatever without requiring the import application writer to shut it down and restart it periodically... Creating data took 331 seconds 100 nodes created. Took 29001 200 nodes created. Took 35107 300 nodes created. Took 35904 400 nodes created. Took 66169 500 nodes created. Took 63280 600 nodes created. Took 183922 700 nodes created. Took 258276 com.nomura.smo.rdm.neo4j.restore.Hepper createData(330.364seconds) runImport (1,485 seconds later...) java.lang.OutOfMemoryError: Java heap space at java.util.ArrayList.init(Unknown Source) at java.util.ArrayList.init(Unknown Source) at org.neo4j.kernel.impl.nioneo.store.PropertyRecord.init(PropertyRecord.java:33) at org.neo4j.kernel.impl.batchinsert.BatchInserterImpl.createPropertyChain(BatchInserterImpl.java:425) at org.neo4j.kernel.impl.batchinsert.BatchInserterImpl.createNode(BatchInserterImpl.java:143) at com.nomura.smo.rdm.neo4j.restore.Hepper.runImport(Hepper.java:61) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20) at org.junit.runners.BlockJUnit4ClassRunner.runNotIgnored(BlockJUnit4ClassRunner.java:79) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:71) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:49) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184) at org.junit.runners.ParentRunner.run(ParentRunner.java:236) at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:49) at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197) Regards, Paul Bandler On 9 Jun 2011, at 12:27, Michael Hunger wrote: I recreated Daniels code in Java, mainly because some things were missing from his scala example. You're right that the index is the bottleneck. But with your small data set it should be possible to cache the 10m nodes in a heap that fits in your machine. I ran it first with the index and had about 8 seconds / 1M nodes and 320 sec/1M rels. Then I switched to 3G heap and a HashMap to keep the name=node lookup and it went to 2s/1M nodes and 13 down-to 3 sec for 1M rels. That is the approach that Chris takes only that his solution can persist the map to disk and is more efficient :) Hope that helps. Michael package org.neo4j.load; import org.apache.commons.io.FileUtils; import org.junit.Test; import org.neo4j.graphdb.RelationshipType; import org.neo4j.graphdb.index.BatchInserterIndex; import org.neo4j.graphdb.index.BatchInserterIndexProvider; import org.neo4j.helpers.collection.MapUtil; import org.neo4j.index.impl.lucene.LuceneBatchInserterIndexProvider; import org.neo4j.kernel.impl.batchinsert.BatchInserter; import org.neo4j.kernel.impl.batchinsert.BatchInserterImpl;
Re: [Neo4j] Speeding up initial import of graph
Please keep in mind that the HashMap of 10M strings - longs will take a substantial amount of heap memory. That's not the fault of Neo4j :) On my system it alone takes 1.8 G of memory (distributed across the strings, the hashmap-entries and the longs). So 3 GB of heap are sensible to run this, that leaves about 1G for neo4j + its caches. Of course you're free to shard you map (e.g. by first letter of the name) and persist those maps to disk and reload them if needed. But that's an application level concern. If your are really limited that way wrt memory you should try Chris Giorans implementation which will take care of that. Or you should perhaps use an amazon ec2 instance which you can easily get with up to 68 G of RAM :) Cheers Michael P.S. As a side-note: For the rest of the memory: Have you tried to use weak reference cache instead of the default soft one? in your config.properties add cache_type = weak that should take care of your memory problems (and the stopping which is actually the GC trying to reclaim memory). Am 09.06.2011 um 22:36 schrieb Paul Bandler: I ran Michael’s example test import program with the Map replacing the index on my on more modestly configured machine to see whether the import scaling problems I have reported previously using Batchinserter were reproduced. They were – I gave the program 1G of heap and watched it run using jconsole. It ran reasonably quickly as it consumed the in an almost straight line until it neared its capacity then practically stopped for about 20 minutes after which it died with an out of memory error – see below. Now I’m not saying that Neo4j should necessarily go out of its way to support very memory constrained environments, but I do think that it is not unreasonable to expect its batch import mechanism not to fall over in this way but should rather flush its buffers or whatever without requiring the import application writer to shut it down and restart it periodically... Creating data took 331 seconds 100 nodes created. Took 29001 200 nodes created. Took 35107 300 nodes created. Took 35904 400 nodes created. Took 66169 500 nodes created. Took 63280 600 nodes created. Took 183922 700 nodes created. Took 258276 com.nomura.smo.rdm.neo4j.restore.Hepper createData(330.364seconds) runImport (1,485 seconds later...) java.lang.OutOfMemoryError: Java heap space at java.util.ArrayList.init(Unknown Source) at java.util.ArrayList.init(Unknown Source) at org.neo4j.kernel.impl.nioneo.store.PropertyRecord.init(PropertyRecord.java:33) at org.neo4j.kernel.impl.batchinsert.BatchInserterImpl.createPropertyChain(BatchInserterImpl.java:425) at org.neo4j.kernel.impl.batchinsert.BatchInserterImpl.createNode(BatchInserterImpl.java:143) at com.nomura.smo.rdm.neo4j.restore.Hepper.runImport(Hepper.java:61) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20) at org.junit.runners.BlockJUnit4ClassRunner.runNotIgnored(BlockJUnit4ClassRunner.java:79) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:71) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:49) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184) at org.junit.runners.ParentRunner.run(ParentRunner.java:236) at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:49) at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197) Regards, Paul Bandler On 9 Jun 2011, at 12:27, Michael Hunger