I recreated Daniels code in Java, mainly because some things were missing from 
his scala example.

You're right that the index is the bottleneck. But with your small data set it 
should be possible to cache the 10m nodes in a heap that fits in your machine.

I ran it first with the index and had about 8 seconds / 1M nodes and 320 sec/1M 
rels.

Then I switched to 3G heap and a HashMap to keep the name=>node lookup and it 
went to 2s/1M nodes and 13 down-to 3 sec for 1M rels.

That is the approach that Chris takes only that his solution can persist the 
map to disk and is more efficient :)

Hope that helps.

Michael

package org.neo4j.load;

import org.apache.commons.io.FileUtils;
import org.junit.Test;
import org.neo4j.graphdb.RelationshipType;
import org.neo4j.graphdb.index.BatchInserterIndex;
import org.neo4j.graphdb.index.BatchInserterIndexProvider;
import org.neo4j.helpers.collection.MapUtil;
import org.neo4j.index.impl.lucene.LuceneBatchInserterIndexProvider;
import org.neo4j.kernel.impl.batchinsert.BatchInserter;
import org.neo4j.kernel.impl.batchinsert.BatchInserterImpl;

import java.io.*;
import java.util.HashMap;
import java.util.Map;
import java.util.Random;

/**
 * @author mh
 * @since 09.06.11
 */
public class Hepper {

    public static final int REPORT_COUNT = Config.MILLION;

    enum MyRelationshipTypes implements RelationshipType {
        BELONGS_TO
    }

    public static final int COUNT = Config.MILLION * 10;

    @Test
    public void createData() throws IOException {
        long time = System.currentTimeMillis();
        final PrintWriter writer = new PrintWriter(new BufferedWriter(new 
FileWriter("data.txt")));
        Random r = new Random(-1L);
        for (int nodes = 0; nodes < COUNT; nodes++) {
            writer.printf("%07d|%07d|%07d%n", nodes, r.nextInt(COUNT), 
r.nextInt(COUNT));
        }
        writer.close();
        System.out.println("Creating data took "+ (System.currentTimeMillis() - 
time) / 1000 +" seconds");
    }

    @Test
    public void runImport() throws IOException {
        Map<String,Long> cache=new HashMap<String, Long>(COUNT);
        final File storeDir = new File("target/hepper");
        FileUtils.deleteDirectory(storeDir);
        BatchInserter inserter = new 
BatchInserterImpl(storeDir.getAbsolutePath());
        final BatchInserterIndexProvider indexProvider = new 
LuceneBatchInserterIndexProvider(inserter);
        final BatchInserterIndex index = indexProvider.nodeIndex("pages", 
MapUtil.stringMap("type", "exact"));
        BufferedReader reader = new BufferedReader(new FileReader("data.txt"));
        String line = null;
        int nodes = 0;
        long time = System.currentTimeMillis();
        long batchTime=time;
        while ((line = reader.readLine()) != null) {
            final String[] nodeNames = line.split("\\|");
            final String name = nodeNames[0];
            final Map<String, Object> props = MapUtil.map("name", name);
            final long node = inserter.createNode(props);
            //index.add(node, props);
            cache.put(name,node);
            nodes++;
            if ((nodes % REPORT_COUNT) == 0) {
                System.out.printf("%d nodes created. Took %d %n", nodes, 
(System.currentTimeMillis() - batchTime));
                batchTime = System.currentTimeMillis();
            }
        }
        
        System.out.println("Creating nodes took "+ (System.currentTimeMillis() 
- time) / 1000);
        index.flush();
        reader.close();
        reader = new BufferedReader(new FileReader("data.txt"));
        int rels = 0;
        time = System.currentTimeMillis();
        batchTime=time;
        while ((line = reader.readLine()) != null) {
            final String[] nodeNames = line.split("\\|");
            final String name = nodeNames[0];
            //final Long from = index.get("name", name).getSingle();
            Long from =cache.get(name);
            for (int j = 1; j < nodeNames.length; j++) {
                //final Long to = index.get("name", nodeNames[j]).getSingle();
                final Long to = cache.get(name);
                inserter.createRelationship(from, to, 
MyRelationshipTypes.BELONGS_TO,null);
            }
            rels++;
            if ((rels % REPORT_COUNT) == 0) {
                System.out.printf("%d relationships created. Took %d %n", rels, 
(System.currentTimeMillis() - batchTime));
                batchTime = System.currentTimeMillis();
            }
        }
        System.out.println("Creating relationships took "+ 
(System.currentTimeMillis() - time) / 1000);
    }
}


1000000 nodes created. Took 2227 
2000000 nodes created. Took 1930 
3000000 nodes created. Took 1818 
4000000 nodes created. Took 1966 
5000000 nodes created. Took 1857 
6000000 nodes created. Took 2009 
7000000 nodes created. Took 2068 
8000000 nodes created. Took 1991 
9000000 nodes created. Took 2151 
10000000 nodes created. Took 2276 
Creating nodes took 20
1000000 relationships created. Took 13441 
2000000 relationships created. Took 12887 
3000000 relationships created. Took 12922 
4000000 relationships created. Took 13149 
5000000 relationships created. Took 14177 
6000000 relationships created. Took 3377 
7000000 relationships created. Took 2932 
8000000 relationships created. Took 2991 
9000000 relationships created. Took 2992 
10000000 relationships created. Took 2912 
Creating relationships took 81

Am 09.06.2011 um 12:51 schrieb Chris Gioran:

> Hi Daniel,
> 
> I am working currently on a tool for importing big data sets into Neo4j 
> graphs.
> The main problem in such operations is that the usual index
> implementations are just too
> slow for retrieving the mapping from keys to created node ids, so a
> custom solution is
> needed, that is dependent to a varying degree on the distribution of
> values of the input set.
> 
> While your dataset is smaller than the data sizes i deal with, i would
> like to use it as a test case. If you could
> provide somehow the actual data or something that emulates them, I
> would be grateful.
> 
> If you want to see my approach, it is available here
> 
> https://github.com/digitalstain/BigDataImport
> 
> The core algorithm is an XJoin style two-level-hashing scheme with
> adaptable eviction strategies but it is not production ready yet,
> mainly from an API perspective.
> 
> You can contact me directly for any details regarding this issue.
> 
> cheers,
> CG
> 
> On Thu, Jun 9, 2011 at 12:59 PM, Daniel Hepper <[email protected]> 
> wrote:
>> Hi all,
>> 
>> I'm struggling with importing a graph with about 10m nodes and 20m
>> relationships, with nodes having 0 to 10 relationships. Creating the
>> nodes takes about 10 minutes, but creating the relationships is slower
>> by several orders of magnitude. I'm using a 2.4 GHz i7 MacBookPro with
>> 4GB RAM and conventional HDD.
>> 
>> The graph is stored as adjacency list in a text file where each line
>> has this form:
>> 
>> Foo|Bar|Baz
>> (Node Foo has relations to Bar and Baz)
>> 
>> My current approach is to iterate over the whole file twice. In the
>> first run, I create a node with the property "name" for the first
>> entry in the line (Foo in this case) and add it to an index.
>> In the second run, I get the start node and the end nodes from the
>> index by name and create the relationships.
>> 
>> My code can be found here: http://pastie.org/2041801
>> 
>> With my approach, the best I can achieve is 100 created relationships
>> per second.
>> I experimented with mapped memory settings, but without much effect.
>> Is this the speed I can expect?
>> Any advice on how to speed up this process?
>> 
>> Best regards,
>> Daniel Hepper
>> _______________________________________________
>> Neo4j mailing list
>> [email protected]
>> https://lists.neo4j.org/mailman/listinfo/user
>> 
> _______________________________________________
> Neo4j mailing list
> [email protected]
> https://lists.neo4j.org/mailman/listinfo/user

_______________________________________________
Neo4j mailing list
[email protected]
https://lists.neo4j.org/mailman/listinfo/user

Reply via email to