Re: [Neo4j] Speeding up initial import of graph

2011-06-10 Thread Paul Bandler

On 9 Jun 2011, at 22:12, Michael Hunger wrote:

 Please keep in mind that the HashMap of 10M strings - longs will take a 
 substantial amount of heap memory.
 That's not the fault of Neo4j :) On my system it alone takes 1.8 G of memory 
 (distributed across the strings, the hashmap-entries and the longs).


Fair enough,  but removing the Map and using the Index instead and setting the 
cache_type to weak makes almost no difference to the programs behaviour in 
terms of progressively consuming the heap until it fails.  I did this, 
including removal of the allocation of the Map, and watched to heap consumption 
follow a similar pattern until it failed as below.

  Or you should perhaps use an amazon ec2 instance which you can easily get 
 with up to 68 G of RAM :)

With respect, and while I notice the smile, throwing memory at it is not an 
option for a large set of enterprise applications that might actually be 
willing to pay to use Neo4j if it didn't fail at the first hurdle when 
confronted with a trivial and small scale data load...

runImport failed after 2,072 seconds
 
Creating data took 316 seconds
Physical mem: 1535MB, Heap size: 1016MB
use_memory_mapped_buffers=false
neostore.propertystore.db.index.keys.mapped_memory=1M
neostore.propertystore.db.strings.mapped_memory=52M
neostore.propertystore.db.arrays.mapped_memory=60M
neo_store=N:\TradeModel\target\hepper\neostore
neostore.relationshipstore.db.mapped_memory=76M
neostore.propertystore.db.index.mapped_memory=1M
neostore.propertystore.db.mapped_memory=62M
dump_configuration=true
cache_type=weak
neostore.nodestore.db.mapped_memory=17M
100 nodes created. Took 59906
200 nodes created. Took 64546
300 nodes created. Took 74577
400 nodes created. Took 82607
500 nodes created. Took 171091
Exception in thread RMI TCP Connection(idle) java.lang.OutOfMemoryError: Java 
heap space
at java.io.BufferedOutputStream.init(Unknown Source)
at java.io.BufferedOutputStream.init(Unknown Source)
at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(Unknown 
Source)
at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(Unknown 
Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown 
Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Exception in thread RMI TCP Connection(idle) java.lang.OutOfMemoryError: Java 
heap space
at java.io.BufferedInputStream.init(Unknown Source)
at java.io.BufferedInputStream.init(Unknown Source)
at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(Unknown 
Source)
at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(Unknown 
Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown 
Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Exception in thread RMI TCP Connection(idle) java.lang.OutOfMemoryError: Java 
heap space
at java.io.BufferedOutputStream.init(Unknown Source)
at java.io.BufferedOutputStream.init(Unknown Source)
at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(Unknown 
Source)
at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(Unknown 
Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown 
Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Exception in thread RMI TCP Connection(idle) java.lang.OutOfMemoryError: Java 
heap space
at java.io.BufferedInputStream.init(Unknown Source)
at java.io.BufferedInputStream.init(Unknown Source)
at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(Unknown 
Source)
at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(Unknown 
Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown 
Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
 
 
 

 So 3 GB of heap are sensible to run this, that leaves about 1G for neo4j + 
 its caches.
 
 Of course you're free to shard you map (e.g. by first letter of the name) and 
 persist those maps to disk and reload them if needed. But that's an 
 application level concern.
 If your are really limited that way wrt memory you should try Chris Giorans 
 implementation which will take care of that. Or you should perhaps use an 
 amazon ec2 instance which you can easily get with up to 68 G of RAM :)
 
 Cheers
 
 Michael
 
 
 P.S. As a side-note:
 For the rest of the memory:
 Have you tried to use weak reference cache instead of the default soft one?
 in your config.properties add
 cache_type = weak
 that should take care of your memory problems (and the stopping which is 
 actually the GC trying to reclaim memory).
 
 Am 

Re: [Neo4j] Speeding up initial import of graph

2011-06-10 Thread Michael Hunger
You're right the lucene based import shouldn't fail for memory problems, I will 
look into that.

My suggestion is valid if you want to use an in memory map to speed up the 
import. And if you're able to perhaps analyze / partition your data that might 
be a viable solution.

Will get back to you with the findings later.

Michael

Am 10.06.2011 um 09:02 schrieb Paul Bandler:

 
 On 9 Jun 2011, at 22:12, Michael Hunger wrote:
 
 Please keep in mind that the HashMap of 10M strings - longs will take a 
 substantial amount of heap memory.
 That's not the fault of Neo4j :) On my system it alone takes 1.8 G of memory 
 (distributed across the strings, the hashmap-entries and the longs).
 
 
 Fair enough,  but removing the Map and using the Index instead and setting 
 the cache_type to weak makes almost no difference to the programs behaviour 
 in terms of progressively consuming the heap until it fails.  I did this, 
 including removal of the allocation of the Map, and watched to heap 
 consumption follow a similar pattern until it failed as below.
 
 Or you should perhaps use an amazon ec2 instance which you can easily get 
 with up to 68 G of RAM :)
 
 With respect, and while I notice the smile, throwing memory at it is not an 
 option for a large set of enterprise applications that might actually be 
 willing to pay to use Neo4j if it didn't fail at the first hurdle when 
 confronted with a trivial and small scale data load...
 
 runImport failed after 2,072 seconds
 
 Creating data took 316 seconds
 Physical mem: 1535MB, Heap size: 1016MB
 use_memory_mapped_buffers=false
 neostore.propertystore.db.index.keys.mapped_memory=1M
 neostore.propertystore.db.strings.mapped_memory=52M
 neostore.propertystore.db.arrays.mapped_memory=60M
 neo_store=N:\TradeModel\target\hepper\neostore
 neostore.relationshipstore.db.mapped_memory=76M
 neostore.propertystore.db.index.mapped_memory=1M
 neostore.propertystore.db.mapped_memory=62M
 dump_configuration=true
 cache_type=weak
 neostore.nodestore.db.mapped_memory=17M
 100 nodes created. Took 59906
 200 nodes created. Took 64546
 300 nodes created. Took 74577
 400 nodes created. Took 82607
 500 nodes created. Took 171091
 Exception in thread RMI TCP Connection(idle) java.lang.OutOfMemoryError: 
 Java heap space
at java.io.BufferedOutputStream.init(Unknown Source)
at java.io.BufferedOutputStream.init(Unknown Source)
at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(Unknown 
 Source)
at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(Unknown 
 Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown 
 Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
 Exception in thread RMI TCP Connection(idle) java.lang.OutOfMemoryError: 
 Java heap space
at java.io.BufferedInputStream.init(Unknown Source)
at java.io.BufferedInputStream.init(Unknown Source)
at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(Unknown 
 Source)
at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(Unknown 
 Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown 
 Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
 Exception in thread RMI TCP Connection(idle) java.lang.OutOfMemoryError: 
 Java heap space
at java.io.BufferedOutputStream.init(Unknown Source)
at java.io.BufferedOutputStream.init(Unknown Source)
at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(Unknown 
 Source)
at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(Unknown 
 Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown 
 Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
 Exception in thread RMI TCP Connection(idle) java.lang.OutOfMemoryError: 
 Java heap space
at java.io.BufferedInputStream.init(Unknown Source)
at java.io.BufferedInputStream.init(Unknown Source)
at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(Unknown 
 Source)
at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(Unknown 
 Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown 
 Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
 
 
 
 
 So 3 GB of heap are sensible to run this, that leaves about 1G for neo4j + 
 its caches.
 
 Of course you're free to shard you map (e.g. by first letter of the name) 
 and persist those maps to disk and reload them if needed. But that's an 
 application level concern.
 If your are really limited that way wrt memory you should try Chris Giorans 
 implementation which will take care of that. 

[Neo4j] Speeding up initial import of graph

2011-06-09 Thread Daniel Hepper
Hi all,

I'm struggling with importing a graph with about 10m nodes and 20m
relationships, with nodes having 0 to 10 relationships. Creating the
nodes takes about 10 minutes, but creating the relationships is slower
by several orders of magnitude. I'm using a 2.4 GHz i7 MacBookPro with
4GB RAM and conventional HDD.

The graph is stored as adjacency list in a text file where each line
has this form:

Foo|Bar|Baz
(Node Foo has relations to Bar and Baz)

My current approach is to iterate over the whole file twice. In the
first run, I create a node with the property name for the first
entry in the line (Foo in this case) and add it to an index.
In the second run, I get the start node and the end nodes from the
index by name and create the relationships.

My code can be found here: http://pastie.org/2041801

With my approach, the best I can achieve is 100 created relationships
per second.
I experimented with mapped memory settings, but without much effect.
Is this the speed I can expect?
Any advice on how to speed up this process?

Best regards,
Daniel Hepper
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] Speeding up initial import of graph

2011-06-09 Thread Paul Bandler
I too am experiencing similar problems - possibly worse than you're seeing as I 
am using a very modestly provisioned windows m/c (1.5Gb ram, setting max heap 
to 1Gb, oldish processor).

I found that using the BatchInserter for loading nodes the heap grew and grew 
until when it was exhausted everything ground to a halt practically.  I 
experimented with various settings of the cache memory but nothing made much 
difference. So now I reset the BatchInserter (i.e. shutdown and re-start it) 
ever 100,000 nodes or so.  I posted questions on the list before but the 
replies seemed to suggest that it was just a config issue - but no config 
changes I made helped much.   I get the impression that most people are using 
Neo4j with hugely larger memory footprints than I can realistically expect to 
use at this stage and so maybe that is why this problem may not receive much 
attention. 

I have a similar approach to you for relationships - i.e. creating them in a 
second pass.  I'm not sure how memory hungry it is, but again have built a 
class that resets the inserters every 100,000 relationships.  It is slow, but 
experimenting with my 'reset' size, didn't make much difference so I'm 
suspecting that its limited by index access time.  Effectively I suspect it's 
going to disk for every index look up that it sees for the first time, and also 
suspect that the size of the index might make a difference as I have over 3m 
nodes in some indexes and these are the ones that are very slow.

I suspect there might be some tuning that can be done, and I really think the 
problem with running out of heap is probably a bug that should be fixed, but am 
now turning my attention to finding ways of creating relationships when the 
initial nodes are created (at least for those for which this is possible) to 
avoid the index lookup overhead...

I'll let you know if/how this helps, but am also interested to learn of others 
experience.

On 9 Jun 2011, at 10:59, Daniel Hepper wrote:

 Hi all,
 
 I'm struggling with importing a graph with about 10m nodes and 20m
 relationships, with nodes having 0 to 10 relationships. Creating the
 nodes takes about 10 minutes, but creating the relationships is slower
 by several orders of magnitude. I'm using a 2.4 GHz i7 MacBookPro with
 4GB RAM and conventional HDD.
 
 The graph is stored as adjacency list in a text file where each line
 has this form:
 
 Foo|Bar|Baz
 (Node Foo has relations to Bar and Baz)
 
 My current approach is to iterate over the whole file twice. In the
 first run, I create a node with the property name for the first
 entry in the line (Foo in this case) and add it to an index.
 In the second run, I get the start node and the end nodes from the
 index by name and create the relationships.
 
 My code can be found here: http://pastie.org/2041801
 
 With my approach, the best I can achieve is 100 created relationships
 per second.
 I experimented with mapped memory settings, but without much effect.
 Is this the speed I can expect?
 Any advice on how to speed up this process?
 
 Best regards,
 Daniel Hepper
 ___
 Neo4j mailing list
 User@lists.neo4j.org
 https://lists.neo4j.org/mailman/listinfo/user

___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] Speeding up initial import of graph

2011-06-09 Thread Chris Gioran
Hi Daniel,

I am working currently on a tool for importing big data sets into Neo4j graphs.
The main problem in such operations is that the usual index
implementations are just too
slow for retrieving the mapping from keys to created node ids, so a
custom solution is
needed, that is dependent to a varying degree on the distribution of
values of the input set.

While your dataset is smaller than the data sizes i deal with, i would
like to use it as a test case. If you could
provide somehow the actual data or something that emulates them, I
would be grateful.

If you want to see my approach, it is available here

https://github.com/digitalstain/BigDataImport

The core algorithm is an XJoin style two-level-hashing scheme with
adaptable eviction strategies but it is not production ready yet,
mainly from an API perspective.

You can contact me directly for any details regarding this issue.

cheers,
CG

On Thu, Jun 9, 2011 at 12:59 PM, Daniel Hepper daniel.hep...@gmail.com wrote:
 Hi all,

 I'm struggling with importing a graph with about 10m nodes and 20m
 relationships, with nodes having 0 to 10 relationships. Creating the
 nodes takes about 10 minutes, but creating the relationships is slower
 by several orders of magnitude. I'm using a 2.4 GHz i7 MacBookPro with
 4GB RAM and conventional HDD.

 The graph is stored as adjacency list in a text file where each line
 has this form:

 Foo|Bar|Baz
 (Node Foo has relations to Bar and Baz)

 My current approach is to iterate over the whole file twice. In the
 first run, I create a node with the property name for the first
 entry in the line (Foo in this case) and add it to an index.
 In the second run, I get the start node and the end nodes from the
 index by name and create the relationships.

 My code can be found here: http://pastie.org/2041801

 With my approach, the best I can achieve is 100 created relationships
 per second.
 I experimented with mapped memory settings, but without much effect.
 Is this the speed I can expect?
 Any advice on how to speed up this process?

 Best regards,
 Daniel Hepper
 ___
 Neo4j mailing list
 User@lists.neo4j.org
 https://lists.neo4j.org/mailman/listinfo/user

___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] Speeding up initial import of graph

2011-06-09 Thread Michael Hunger
I recreated Daniels code in Java, mainly because some things were missing from 
his scala example.

You're right that the index is the bottleneck. But with your small data set it 
should be possible to cache the 10m nodes in a heap that fits in your machine.

I ran it first with the index and had about 8 seconds / 1M nodes and 320 sec/1M 
rels.

Then I switched to 3G heap and a HashMap to keep the name=node lookup and it 
went to 2s/1M nodes and 13 down-to 3 sec for 1M rels.

That is the approach that Chris takes only that his solution can persist the 
map to disk and is more efficient :)

Hope that helps.

Michael

package org.neo4j.load;

import org.apache.commons.io.FileUtils;
import org.junit.Test;
import org.neo4j.graphdb.RelationshipType;
import org.neo4j.graphdb.index.BatchInserterIndex;
import org.neo4j.graphdb.index.BatchInserterIndexProvider;
import org.neo4j.helpers.collection.MapUtil;
import org.neo4j.index.impl.lucene.LuceneBatchInserterIndexProvider;
import org.neo4j.kernel.impl.batchinsert.BatchInserter;
import org.neo4j.kernel.impl.batchinsert.BatchInserterImpl;

import java.io.*;
import java.util.HashMap;
import java.util.Map;
import java.util.Random;

/**
 * @author mh
 * @since 09.06.11
 */
public class Hepper {

public static final int REPORT_COUNT = Config.MILLION;

enum MyRelationshipTypes implements RelationshipType {
BELONGS_TO
}

public static final int COUNT = Config.MILLION * 10;

@Test
public void createData() throws IOException {
long time = System.currentTimeMillis();
final PrintWriter writer = new PrintWriter(new BufferedWriter(new 
FileWriter(data.txt)));
Random r = new Random(-1L);
for (int nodes = 0; nodes  COUNT; nodes++) {
writer.printf(%07d|%07d|%07d%n, nodes, r.nextInt(COUNT), 
r.nextInt(COUNT));
}
writer.close();
System.out.println(Creating data took + (System.currentTimeMillis() - 
time) / 1000 + seconds);
}

@Test
public void runImport() throws IOException {
MapString,Long cache=new HashMapString, Long(COUNT);
final File storeDir = new File(target/hepper);
FileUtils.deleteDirectory(storeDir);
BatchInserter inserter = new 
BatchInserterImpl(storeDir.getAbsolutePath());
final BatchInserterIndexProvider indexProvider = new 
LuceneBatchInserterIndexProvider(inserter);
final BatchInserterIndex index = indexProvider.nodeIndex(pages, 
MapUtil.stringMap(type, exact));
BufferedReader reader = new BufferedReader(new FileReader(data.txt));
String line = null;
int nodes = 0;
long time = System.currentTimeMillis();
long batchTime=time;
while ((line = reader.readLine()) != null) {
final String[] nodeNames = line.split(\\|);
final String name = nodeNames[0];
final MapString, Object props = MapUtil.map(name, name);
final long node = inserter.createNode(props);
//index.add(node, props);
cache.put(name,node);
nodes++;
if ((nodes % REPORT_COUNT) == 0) {
System.out.printf(%d nodes created. Took %d %n, nodes, 
(System.currentTimeMillis() - batchTime));
batchTime = System.currentTimeMillis();
}
}

System.out.println(Creating nodes took + (System.currentTimeMillis() 
- time) / 1000);
index.flush();
reader.close();
reader = new BufferedReader(new FileReader(data.txt));
int rels = 0;
time = System.currentTimeMillis();
batchTime=time;
while ((line = reader.readLine()) != null) {
final String[] nodeNames = line.split(\\|);
final String name = nodeNames[0];
//final Long from = index.get(name, name).getSingle();
Long from =cache.get(name);
for (int j = 1; j  nodeNames.length; j++) {
//final Long to = index.get(name, nodeNames[j]).getSingle();
final Long to = cache.get(name);
inserter.createRelationship(from, to, 
MyRelationshipTypes.BELONGS_TO,null);
}
rels++;
if ((rels % REPORT_COUNT) == 0) {
System.out.printf(%d relationships created. Took %d %n, rels, 
(System.currentTimeMillis() - batchTime));
batchTime = System.currentTimeMillis();
}
}
System.out.println(Creating relationships took + 
(System.currentTimeMillis() - time) / 1000);
}
}


100 nodes created. Took 2227 
200 nodes created. Took 1930 
300 nodes created. Took 1818 
400 nodes created. Took 1966 
500 nodes created. Took 1857 
600 nodes created. Took 2009 
700 nodes created. Took 2068 
800 nodes created. Took 1991 
900 nodes created. Took 2151 
1000 nodes created. Took 2276 
Creating nodes took 20
100 relationships created. Took 13441 
200 relationships created. 

Re: [Neo4j] Speeding up initial import of graph

2011-06-09 Thread Daniel Hepper
I will try caching the nodes in the heap as Michael suggested and I'll
also look into Chris' tool.

Thanks everybody for the effort and the suggestions!

Daniel


On Thu, Jun 9, 2011 at 1:27 PM, Michael Hunger
michael.hun...@neotechnology.com wrote:
 I recreated Daniels code in Java, mainly because some things were missing 
 from his scala example.

 You're right that the index is the bottleneck. But with your small data set 
 it should be possible to cache the 10m nodes in a heap that fits in your 
 machine.

 I ran it first with the index and had about 8 seconds / 1M nodes and 320 
 sec/1M rels.

 Then I switched to 3G heap and a HashMap to keep the name=node lookup and it 
 went to 2s/1M nodes and 13 down-to 3 sec for 1M rels.

 That is the approach that Chris takes only that his solution can persist the 
 map to disk and is more efficient :)

 Hope that helps.

 Michael

 package org.neo4j.load;

 import org.apache.commons.io.FileUtils;
 import org.junit.Test;
 import org.neo4j.graphdb.RelationshipType;
 import org.neo4j.graphdb.index.BatchInserterIndex;
 import org.neo4j.graphdb.index.BatchInserterIndexProvider;
 import org.neo4j.helpers.collection.MapUtil;
 import org.neo4j.index.impl.lucene.LuceneBatchInserterIndexProvider;
 import org.neo4j.kernel.impl.batchinsert.BatchInserter;
 import org.neo4j.kernel.impl.batchinsert.BatchInserterImpl;

 import java.io.*;
 import java.util.HashMap;
 import java.util.Map;
 import java.util.Random;

 /**
  * @author mh
  * @since 09.06.11
  */
 public class Hepper {

    public static final int REPORT_COUNT = Config.MILLION;

    enum MyRelationshipTypes implements RelationshipType {
        BELONGS_TO
    }

    public static final int COUNT = Config.MILLION * 10;

    @Test
    public void createData() throws IOException {
        long time = System.currentTimeMillis();
        final PrintWriter writer = new PrintWriter(new BufferedWriter(new 
 FileWriter(data.txt)));
        Random r = new Random(-1L);
        for (int nodes = 0; nodes  COUNT; nodes++) {
            writer.printf(%07d|%07d|%07d%n, nodes, r.nextInt(COUNT), 
 r.nextInt(COUNT));
        }
        writer.close();
        System.out.println(Creating data took + (System.currentTimeMillis() 
 - time) / 1000 + seconds);
    }

    @Test
    public void runImport() throws IOException {
        MapString,Long cache=new HashMapString, Long(COUNT);
        final File storeDir = new File(target/hepper);
        FileUtils.deleteDirectory(storeDir);
        BatchInserter inserter = new 
 BatchInserterImpl(storeDir.getAbsolutePath());
        final BatchInserterIndexProvider indexProvider = new 
 LuceneBatchInserterIndexProvider(inserter);
        final BatchInserterIndex index = indexProvider.nodeIndex(pages, 
 MapUtil.stringMap(type, exact));
        BufferedReader reader = new BufferedReader(new FileReader(data.txt));
        String line = null;
        int nodes = 0;
        long time = System.currentTimeMillis();
        long batchTime=time;
        while ((line = reader.readLine()) != null) {
            final String[] nodeNames = line.split(\\|);
            final String name = nodeNames[0];
            final MapString, Object props = MapUtil.map(name, name);
            final long node = inserter.createNode(props);
            //index.add(node, props);
            cache.put(name,node);
            nodes++;
            if ((nodes % REPORT_COUNT) == 0) {
                System.out.printf(%d nodes created. Took %d %n, nodes, 
 (System.currentTimeMillis() - batchTime));
                batchTime = System.currentTimeMillis();
            }
        }

        System.out.println(Creating nodes took + (System.currentTimeMillis() 
 - time) / 1000);
        index.flush();
        reader.close();
        reader = new BufferedReader(new FileReader(data.txt));
        int rels = 0;
        time = System.currentTimeMillis();
        batchTime=time;
        while ((line = reader.readLine()) != null) {
            final String[] nodeNames = line.split(\\|);
            final String name = nodeNames[0];
            //final Long from = index.get(name, name).getSingle();
            Long from =cache.get(name);
            for (int j = 1; j  nodeNames.length; j++) {
                //final Long to = index.get(name, nodeNames[j]).getSingle();
                final Long to = cache.get(name);
                inserter.createRelationship(from, to, 
 MyRelationshipTypes.BELONGS_TO,null);
            }
            rels++;
            if ((rels % REPORT_COUNT) == 0) {
                System.out.printf(%d relationships created. Took %d %n, 
 rels, (System.currentTimeMillis() - batchTime));
                batchTime = System.currentTimeMillis();
            }
        }
        System.out.println(Creating relationships took + 
 (System.currentTimeMillis() - time) / 1000);
    }
 }


 100 nodes created. Took 2227
 200 nodes created. Took 1930
 300 nodes created. Took 1818
 400 nodes created. Took 1966
 500 nodes 

Re: [Neo4j] Speeding up initial import of graph

2011-06-09 Thread Paul Bandler
I ran Michael’s  example test import program with the Map replacing the index 
on my on more modestly configured machine to see whether the import scaling 
problems I have reported previously using Batchinserter were reproduced.  They 
were – I gave the program 1G of heap and watched it run using jconsole.  It ran 
reasonably quickly as it consumed the in an almost straight line until it 
neared its capacity then practically stopped for about 20 minutes after which 
it died with an out of memory error – see below.
 
Now I’m not saying that Neo4j should necessarily go out of its way to support 
very memory constrained environments, but I do think that it is not 
unreasonable to expect its batch import mechanism not to fall over in this way 
but should rather flush its buffers or whatever without requiring the import 
application writer to shut it down and restart it periodically...
 
Creating data took 331 seconds
100 nodes created. Took 29001
200 nodes created. Took 35107
300 nodes created. Took 35904
400 nodes created. Took 66169
500 nodes created. Took 63280
600 nodes created. Took 183922
700 nodes created. Took 258276
 
com.nomura.smo.rdm.neo4j.restore.Hepper
createData(330.364seconds)
runImport (1,485 seconds later...)
java.lang.OutOfMemoryError: Java heap space
at java.util.ArrayList.init(Unknown Source)
at java.util.ArrayList.init(Unknown Source)
at 
org.neo4j.kernel.impl.nioneo.store.PropertyRecord.init(PropertyRecord.java:33)
at 
org.neo4j.kernel.impl.batchinsert.BatchInserterImpl.createPropertyChain(BatchInserterImpl.java:425)
at 
org.neo4j.kernel.impl.batchinsert.BatchInserterImpl.createNode(BatchInserterImpl.java:143)
at com.nomura.smo.rdm.neo4j.restore.Hepper.runImport(Hepper.java:61)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
at 
org.junit.runners.BlockJUnit4ClassRunner.runNotIgnored(BlockJUnit4ClassRunner.java:79)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:71)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:49)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184)
at org.junit.runners.ParentRunner.run(ParentRunner.java:236)
at 
org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:49)
at 
org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)
 
 
Regards,
Paul Bandler 
On 9 Jun 2011, at 12:27, Michael Hunger wrote:

 I recreated Daniels code in Java, mainly because some things were missing 
 from his scala example.
 
 You're right that the index is the bottleneck. But with your small data set 
 it should be possible to cache the 10m nodes in a heap that fits in your 
 machine.
 
 I ran it first with the index and had about 8 seconds / 1M nodes and 320 
 sec/1M rels.
 
 Then I switched to 3G heap and a HashMap to keep the name=node lookup and it 
 went to 2s/1M nodes and 13 down-to 3 sec for 1M rels.
 
 That is the approach that Chris takes only that his solution can persist the 
 map to disk and is more efficient :)
 
 Hope that helps.
 
 Michael
 
 package org.neo4j.load;
 
 import org.apache.commons.io.FileUtils;
 import org.junit.Test;
 import org.neo4j.graphdb.RelationshipType;
 import org.neo4j.graphdb.index.BatchInserterIndex;
 import org.neo4j.graphdb.index.BatchInserterIndexProvider;
 import org.neo4j.helpers.collection.MapUtil;
 import org.neo4j.index.impl.lucene.LuceneBatchInserterIndexProvider;
 import org.neo4j.kernel.impl.batchinsert.BatchInserter;
 import org.neo4j.kernel.impl.batchinsert.BatchInserterImpl;
 
 

Re: [Neo4j] Speeding up initial import of graph

2011-06-09 Thread Michael Hunger
Please keep in mind that the HashMap of 10M strings - longs will take a 
substantial amount of heap memory.
That's not the fault of Neo4j :) On my system it alone takes 1.8 G of memory 
(distributed across the strings, the hashmap-entries and the longs).
So 3 GB of heap are sensible to run this, that leaves about 1G for neo4j + its 
caches.

Of course you're free to shard you map (e.g. by first letter of the name) and 
persist those maps to disk and reload them if needed. But that's an application 
level concern.
If your are really limited that way wrt memory you should try Chris Giorans 
implementation which will take care of that. Or you should perhaps use an 
amazon ec2 instance which you can easily get with up to 68 G of RAM :)

Cheers

Michael


P.S. As a side-note:
For the rest of the memory:
Have you tried to use weak reference cache instead of the default soft one?
in your config.properties add
cache_type = weak
that should take care of your memory problems (and the stopping which is 
actually the GC trying to reclaim memory).

Am 09.06.2011 um 22:36 schrieb Paul Bandler:

 I ran Michael’s  example test import program with the Map replacing the index 
 on my on more modestly configured machine to see whether the import scaling 
 problems I have reported previously using Batchinserter were reproduced.  
 They were – I gave the program 1G of heap and watched it run using jconsole.  
 It ran reasonably quickly as it consumed the in an almost straight line until 
 it neared its capacity then practically stopped for about 20 minutes after 
 which it died with an out of memory error – see below.
 
 Now I’m not saying that Neo4j should necessarily go out of its way to support 
 very memory constrained environments, but I do think that it is not 
 unreasonable to expect its batch import mechanism not to fall over in this 
 way but should rather flush its buffers or whatever without requiring the 
 import application writer to shut it down and restart it periodically...
 
 Creating data took 331 seconds
 100 nodes created. Took 29001
 200 nodes created. Took 35107
 300 nodes created. Took 35904
 400 nodes created. Took 66169
 500 nodes created. Took 63280
 600 nodes created. Took 183922
 700 nodes created. Took 258276
 
 com.nomura.smo.rdm.neo4j.restore.Hepper
 createData(330.364seconds)
 runImport (1,485 seconds later...)
 java.lang.OutOfMemoryError: Java heap space
at java.util.ArrayList.init(Unknown Source)
at java.util.ArrayList.init(Unknown Source)
at 
 org.neo4j.kernel.impl.nioneo.store.PropertyRecord.init(PropertyRecord.java:33)
at 
 org.neo4j.kernel.impl.batchinsert.BatchInserterImpl.createPropertyChain(BatchInserterImpl.java:425)
at 
 org.neo4j.kernel.impl.batchinsert.BatchInserterImpl.createNode(BatchInserterImpl.java:143)
at com.nomura.smo.rdm.neo4j.restore.Hepper.runImport(Hepper.java:61)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at 
 org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
at 
 org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
at 
 org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
at 
 org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
at 
 org.junit.runners.BlockJUnit4ClassRunner.runNotIgnored(BlockJUnit4ClassRunner.java:79)
at 
 org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:71)
at 
 org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:49)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184)
at org.junit.runners.ParentRunner.run(ParentRunner.java:236)
at 
 org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:49)
at 
 org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
at 
 org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
at 
 org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
at 
 org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
at 
 org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)
 
 
 Regards,
 Paul Bandler 
 On 9 Jun 2011, at 12:27, Michael Hunger