> can you share your test and the CompactIndex you wrote?
>
> That would be great.
See below...
> Also the memory settings (Xmx) you used for the different runs.
The heap size is displayed by neo4j is it not with console entries such as:-
>> Physical mem: 1535MB, Heap size: 1016MB
So that one came fro -Xmx1024M and
>> Physical mem: 4096MB, Heap size: 2039MB
>
came from -Xms2048M
regards,
Paul
package com.xxx.neo4j.restore;
import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.io.DataInputStream;
import java.io.DataOutputStream;
public class NodeIdPair implements Comparable<NodeIdPair> {
private long _node;
private int _id;
static final NodeIdPair _prototype = new NodeIdPair(Long.MAX_VALUE,
Integer.MAX_VALUE);
static Integer MY_SIZE = null;
public static int size() {
if (MY_SIZE == null) {
MY_SIZE = (new NodeIdPair(Long.MAX_VALUE, Integer.MAX_VALUE))
.toByteArray().length;
System.out.println("MY_SIZE: " + MY_SIZE);
}
return MY_SIZE;
}
public NodeIdPair(long node, int id) {
_node = node;
_id = id;
}
public NodeIdPair(byte fromByteArray[]) {
ByteArrayInputStream bais = new ByteArrayInputStream(fromByteArray);
DataInputStream dis = new DataInputStream(bais);
try {
_node = dis.readLong();
_id = dis.readInt();
} catch (Exception e) {
throw new Error("Unexpected exception. byte[] len " +
fromByteArray.length, e);
}
}
byte[] toByteArray() {
ByteArrayOutputStream bos = new ByteArrayOutputStream(
MY_SIZE != null ? MY_SIZE : 12);
DataOutputStream dos = new DataOutputStream(bos);
try {
dos.writeLong(_node);
dos.writeInt(_id);
dos.flush();
} catch (Exception e) {
throw new Error("Unexpected exception: ", e);
}
return bos.toByteArray();
}
@Override
public int compareTo(NodeIdPair arg0) {
return _id - arg0._id;
}
public long getNode() {
return _node;
}
public int getId() {
return _id;
}
}
package com.xxx.neo4j.restore;
import java.util.Arrays;
import java.util.TreeSet;
public class CompactNodeIndex {
private int _offSet = 0;
private byte _extent[];
private int _slotCount;
public CompactNodeIndex(TreeSet<NodeIdPair> sortedPairs) {
_extent = new byte[sortedPairs.size() * NodeIdPair.size()];
_slotCount = sortedPairs.size();
for (NodeIdPair pair : sortedPairs) {
byte pairBytes[] = pair.toByteArray();
copyToExtent(pairBytes);
}
System.out.println("CompactNodeIndex slot count: " + _slotCount);
}
public NodeIdPair findNodeForId(int id) {
return search(id, 0, _slotCount - 1);
}
@SuppressWarnings("serial")
static class FoundIt extends Exception {
NodeIdPair _result;
FoundIt(NodeIdPair result) {
_result = result;
}
}
private NodeIdPair search(int soughtId, int lowerBound, int upperBound) {
try {
while (true) {
if ((upperBound - lowerBound) > 1) {
int compareSlot = lowerBound
+ ((upperBound - lowerBound) / 2);
int comparison = compareAt(soughtId, compareSlot);
if (comparison > 0) {
lowerBound = compareSlot;
continue;
} else {
upperBound = compareSlot;
}
} else {
compareAt(soughtId, upperBound);
compareAt(soughtId, lowerBound);
// not found it
return null;
}
}
} catch (FoundIt result) {
return result._result;
}
}
private int compareAt(int soughtId, int compareSlot) throws FoundIt {
NodeIdPair candidate = get(compareSlot);
int diff = soughtId - candidate.getId();
if (diff == 0)
throw new FoundIt(candidate);
return diff;
}
private NodeIdPair get(int compareSlot) {
int startPos = compareSlot * NodeIdPair.size();
byte serialisedPair[] = Arrays.copyOfRange(_extent, startPos, startPos
+ NodeIdPair.size());
return new NodeIdPair(serialisedPair);
}
private void copyToExtent(byte[] pairBytes) {
for (byte b : pairBytes) {
if (_offSet > _extent.length)
throw new Error("Unexpected extent overflow: " + _offSet);
_extent[_offSet++] = b;
}
}
}
On 13 Jun 2011, at 13:23, Michael Hunger wrote:
> Paul,
>
> can you share your test and the CompactIndex you wrote?
>
> That would be great.
>
> Also the memory settings (Xmx) you used for the different runs.
>
> Thanks so much
>
> Michael
>
> Am 13.06.2011 um 14:15 schrieb Paul Bandler:
>
>> Having noticed a mention in the 1.4M04 release notes that:
>>
>>> Also, the BatchInserterIndex now keeps its memory usage in-check with
>>> batched commits of indexed data using a configurable batch commit size.
>>
>> I re-ran this test using M04 and sure enough, node creation no longer eats
>> up the heap linearly so that is good - I should be able to remove the
>> periodic resetting of the BatchInserter during import.
>>
>> So I returned to the issue of removing the index creation and later access
>> bottleneck using an application managed data structure as Michael
>> illustrated, but needing a solution with a smaller memory footprint I wrote
>> a CompactNodeIndex class for mapping integer 'id' key values to long nodes
>> that uses a minimum memory footprint by overlaying a binary-choppable table
>> onto a byte array. Watching heap on jconsole while this ran I could see
>> that had the desired effect of releasing huge amounts of heap once it the
>> CompactNodeIndex is loaded and the source data structure gc'd. However when
>> I attempted to scale the test program back up to the 10M nodes Michael had
>> been testing it appears to run into something of a brick wall becoming
>> massively I/O bound when creating the relationships. With 1M nodes it ran
>> ok, with 2M nodes not too bad, but much beyond that it crawls along using
>> just about 1% of CPU but has loads of heap spare.
>>
>> I re-ran on a more generously configured iMac (giving the test 4G of heap)
>> and it did much better in that it actually showed some progress building
>> relationships over a 10M node-set, but still exhibited massive slow down
>> once past 7M relationships.
>>
>> Below are the test results - the question now is are there any Neo4j
>> parameters that might enable this I/O bottleneck that appears when building
>> relationships over such sized node sets with the BatchInserter...? I note
>> the section in the manual on performance parameters, but I'm afraid not
>> being familiar enough with the Neo4j internals I don't feel that they give
>> enough clear information on how to set them improve the performance of this
>> use-case.
>>
>> Thanks,
>>
>> Paul
>>
>> Run 1 - Windows m/c..REPORT_COUNT = MILLION/10
>> Physical mem: 1535MB, Heap size: 1016MB
>> use_memory_mapped_buffers=false
>> neostore.propertystore.db.index.keys.mapped_memory=1M
>> neostore.propertystore.db.strings.mapped_memory=52M
>> neostore.propertystore.db.arrays.mapped_memory=60M
>> neo_store=N:\TradeModel\target\hepper\neostore
>> neostore.relationshipstore.db.mapped_memory=76M
>> neostore.propertystore.db.index.mapped_memory=1M
>> neostore.propertystore.db.mapped_memory=62M
>> dump_configuration=true
>> cache_type=weak
>> neostore.nodestore.db.mapped_memory=17M
>> 100000 nodes created. Took 2906
>> 200000 nodes created. Took 2688
>> 300000 nodes created. Took 2828
>> 400000 nodes created. Took 2953
>> 500000 nodes created. Took 2672
>> 600000 nodes created. Took 2766
>> 700000 nodes created. Took 2687
>> 800000 nodes created. Took 2703
>> 900000 nodes created. Took 2719
>> 1000000 nodes created. Took 2641
>> Creating nodes took 27
>> MY_SIZE: 12
>> CompactNodeIndex slot count: 1000000
>> 100000 relationships created. Took 4125
>> 200000 relationships created. Took 3953
>> 300000 relationships created. Took 3937
>> 400000 relationships created. Took 3610
>> 500000 relationships created. Took 3719
>> 600000 relationships created. Took 4328
>> 700000 relationships created. Took 3750
>> 800000 relationships created. Took 3609
>> 900000 relationships created. Took 4125
>> 1000000 relationships created. Took 3781
>> 1100000 relationships created. Took 4125
>> 1200000 relationships created. Took 3750
>> 1300000 relationships created. Took 3907
>> 1400000 relationships created. Took 4297
>> 1500000 relationships created. Took 3703
>> 1600000 relationships created. Took 3687
>> 1700000 relationships created. Took 4328
>> 1800000 relationships created. Took 3907
>> 1900000 relationships created. Took 3718
>> 2000000 relationships created. Took 3891
>> Creating relationships took 78
>>
>> 2M Nodes on Windows m/c:-
>>
>> Creating data took 68 seconds
>> Physical mem: 1535MB, Heap size: 1016MB
>> use_memory_mapped_buffers=false
>> neostore.propertystore.db.index.keys.mapped_memory=1M
>> neostore.propertystore.db.strings.mapped_memory=52M
>> neostore.propertystore.db.arrays.mapped_memory=60M
>> neo_store=N:\TradeModel\target\hepper\neostore
>> neostore.relationshipstore.db.mapped_memory=76M
>> neostore.propertystore.db.index.mapped_memory=1M
>> neostore.propertystore.db.mapped_memory=62M
>> dump_configuration=true
>> cache_type=weak
>> neostore.nodestore.db.mapped_memory=17M
>> 100000 nodes created. Took 3188
>> 200000 nodes created. Took 3094
>> 300000 nodes created. Took 3062
>> 400000 nodes created. Took 2813
>> 500000 nodes created. Took 2718
>> 600000 nodes created. Took 3000
>> 700000 nodes created. Took 2938
>> 800000 nodes created. Took 2828
>> 900000 nodes created. Took 4172
>> 1000000 nodes created. Took 2859
>> 1100000 nodes created. Took 3625
>> 1200000 nodes created. Took 3235
>> 1300000 nodes created. Took 2781
>> 1400000 nodes created. Took 2891
>> 1500000 nodes created. Took 2922
>> 1600000 nodes created. Took 2968
>> 1700000 nodes created. Took 3438
>> 1800000 nodes created. Took 2687
>> 1900000 nodes created. Took 2969
>> 2000000 nodes created. Took 2891
>> Creating nodes took 61
>> MY_SIZE: 12
>> CompactNodeIndex slot count: 2000000
>> 100000 relationships created. Took 311377
>> 200000 relationships created. Took 11297
>> 300000 relationships created. Took 11062
>> 400000 relationships created. Took 10891
>> 500000 relationships created. Took 11109
>> 600000 relationships created. Took 11375
>> 700000 relationships created. Took 11266
>> 800000 relationships created. Took 26469
>> 900000 relationships created. Took 46875
>> 1000000 relationships created. Took 12047
>> 1100000 relationships created. Took 43016
>> 1200000 relationships created. Took 12110
>> 1300000 relationships created. Took 12625
>> 1400000 relationships created. Took 12031
>> 1500000 relationships created. Took 40375
>> 1600000 relationships created. Took 11328
>> 1700000 relationships created. Took 11125
>> 1800000 relationships created. Took 10891
>> 1900000 relationships created. Took 11266
>> 2000000 relationships created. Took 11125
>> 2100000 relationships created. Took 11281
>> 2200000 relationships created. Took 11156
>> 2300000 relationships created. Took 11250
>> 2400000 relationships created. Took 11735
>> 2500000 relationships created. Took 15984
>> 2600000 relationships created. Took 16766
>> 2700000 relationships created. Took 71969
>> 2800000 relationships created. Took 205283
>> 2900000 relationships created. Took 159236
>> 3000000 relationships created. Took 32734
>> 3100000 relationships created. Took 149064
>> 3200000 relationships created. Took 116391
>> 3300000 relationships created. Took 74079
>> 3400000 relationships created. Took 43360
>> 3500000 relationships created. Took 20500
>> 3600000 relationships created. Took 246704
>> 3700000 relationships created. Took 74407
>> 3800000 relationships created. Took 189611
>> 3900000 relationships created. Took 44922
>> 4000000 relationships created. Took 482675
>> Creating relationships took 2628
>>
>> iMac (REPORT_COUNT = MILLION)
>> Physical mem: 4096MB, Heap size: 2039MB
>> use_memory_mapped_buffers=false
>> neostore.propertystore.db.index.keys.mapped_memory=1M
>> neostore.propertystore.db.strings.mapped_memory=106M
>> neostore.propertystore.db.arrays.mapped_memory=120M
>> neo_store=/Users/paulbandler/Documents/workspace/Neo4jImport/target/hepper/neostore
>> neostore.relationshipstore.db.mapped_memory=152M
>> neostore.propertystore.db.index.mapped_memory=1M
>> neostore.propertystore.db.mapped_memory=124M
>> dump_configuration=true
>> cache_type=weak
>> neostore.nodestore.db.mapped_memory=34M
>> 1000000 nodes created. Took 2817
>> 2000000 nodes created. Took 2407
>> 3000000 nodes created. Took 2086
>> 4000000 nodes created. Took 2303
>> 5000000 nodes created. Took 2912
>> 6000000 nodes created. Took 2178
>> 7000000 nodes created. Took 2241
>> 8000000 nodes created. Took 2453
>> 9000000 nodes created. Took 2627
>> 10000000 nodes created. Took 3996
>> Creating nodes took 26
>> MY_SIZE: 12
>> CompactNodeIndex slot count: 10000000
>> 1000000 relationships created. Took 198784
>> 2000000 relationships created. Took 24203
>> 3000000 relationships created. Took 25313
>> 4000000 relationships created. Took 22177
>> 5000000 relationships created. Took 22406
>> 6000000 relationships created. Took 84977
>> 7000000 relationships created. Took 402123
>> 8000000 relationships created. Took 1342290
>>
>>
>> On 10 Jun 2011, at 08:27, Michael Hunger wrote:
>>
>>> You're right the lucene based import shouldn't fail for memory problems, I
>>> will look into that.
>>>
>>> My suggestion is valid if you want to use an in memory map to speed up the
>>> import. And if you're able to perhaps analyze / partition your data that
>>> might be a viable solution.
>>>
>>> Will get back to you with the findings later.
>>>
>>> Michael
>>>
>>> Am 10.06.2011 um 09:02 schrieb Paul Bandler:
>>>
>>>>
>>>> On 9 Jun 2011, at 22:12, Michael Hunger wrote:
>>>>
>>>>> Please keep in mind that the HashMap of 10M strings -> longs will take a
>>>>> substantial amount of heap memory.
>>>>> That's not the fault of Neo4j :) On my system it alone takes 1.8 G of
>>>>> memory (distributed across the strings, the hashmap-entries and the
>>>>> longs).
>>>>
>>>>
>>>> Fair enough, but removing the Map and using the Index instead and setting
>>>> the cache_type to weak makes almost no difference to the programs
>>>> behaviour in terms of progressively consuming the heap until it fails. I
>>>> did this, including removal of the allocation of the Map, and watched to
>>>> heap consumption follow a similar pattern until it failed as below.
>>>>
>>>>> Or you should perhaps use an amazon ec2 instance which you can easily get
>>>>> with up to 68 G of RAM :)
>>>>
>>>> With respect, and while I notice the smile, throwing memory at it is not
>>>> an option for a large set of enterprise applications that might actually
>>>> be willing to pay to use Neo4j if it didn't fail at the first hurdle when
>>>> confronted with a trivial and small scale data load...
>>>>
>>>> runImport failed after 2,072 seconds....
>>>>
>>>> Creating data took 316 seconds
>>>> Physical mem: 1535MB, Heap size: 1016MB
>>>> use_memory_mapped_buffers=false
>>>> neostore.propertystore.db.index.keys.mapped_memory=1M
>>>> neostore.propertystore.db.strings.mapped_memory=52M
>>>> neostore.propertystore.db.arrays.mapped_memory=60M
>>>> neo_store=N:\TradeModel\target\hepper\neostore
>>>> neostore.relationshipstore.db.mapped_memory=76M
>>>> neostore.propertystore.db.index.mapped_memory=1M
>>>> neostore.propertystore.db.mapped_memory=62M
>>>> dump_configuration=true
>>>> cache_type=weak
>>>> neostore.nodestore.db.mapped_memory=17M
>>>> 1000000 nodes created. Took 59906
>>>> 2000000 nodes created. Took 64546
>>>> 3000000 nodes created. Took 74577
>>>> 4000000 nodes created. Took 82607
>>>> 5000000 nodes created. Took 171091
>>>> Exception in thread "RMI TCP Connection(idle)" java.lang.OutOfMemoryError:
>>>> Java heap space
>>>> at java.io.BufferedOutputStream.<init>(Unknown Source)
>>>> at java.io.BufferedOutputStream.<init>(Unknown Source)
>>>> at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(Unknown
>>>> Source)
>>>> at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(Unknown
>>>> Source)
>>>> at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown
>>>> Source)
>>>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
>>>> at java.lang.Thread.run(Unknown Source)
>>>> Exception in thread "RMI TCP Connection(idle)" java.lang.OutOfMemoryError:
>>>> Java heap space
>>>> at java.io.BufferedInputStream.<init>(Unknown Source)
>>>> at java.io.BufferedInputStream.<init>(Unknown Source)
>>>> at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(Unknown
>>>> Source)
>>>> at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(Unknown
>>>> Source)
>>>> at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown
>>>> Source)
>>>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
>>>> at java.lang.Thread.run(Unknown Source)
>>>> Exception in thread "RMI TCP Connection(idle)" java.lang.OutOfMemoryError:
>>>> Java heap space
>>>> at java.io.BufferedOutputStream.<init>(Unknown Source)
>>>> at java.io.BufferedOutputStream.<init>(Unknown Source)
>>>> at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(Unknown
>>>> Source)
>>>> at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(Unknown
>>>> Source)
>>>> at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown
>>>> Source)
>>>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
>>>> at java.lang.Thread.run(Unknown Source)
>>>> Exception in thread "RMI TCP Connection(idle)" java.lang.OutOfMemoryError:
>>>> Java heap space
>>>> at java.io.BufferedInputStream.<init>(Unknown Source)
>>>> at java.io.BufferedInputStream.<init>(Unknown Source)
>>>> at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(Unknown
>>>> Source)
>>>> at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(Unknown
>>>> Source)
>>>> at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown
>>>> Source)
>>>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
>>>> at java.lang.Thread.run(Unknown Source)
>>>>
>>>>
>>>>
>>>>
>>>>> So 3 GB of heap are sensible to run this, that leaves about 1G for neo4j
>>>>> + its caches.
>>>>>
>>>>> Of course you're free to shard you map (e.g. by first letter of the name)
>>>>> and persist those maps to disk and reload them if needed. But that's an
>>>>> application level concern.
>>>>> If your are really limited that way wrt memory you should try Chris
>>>>> Giorans implementation which will take care of that. Or you should
>>>>> perhaps use an amazon ec2 instance which you can easily get with up to 68
>>>>> G of RAM :)
>>>>>
>>>>> Cheers
>>>>>
>>>>> Michael
>>>>>
>>>>>
>>>>> P.S. As a side-note:
>>>>> For the rest of the memory:
>>>>> Have you tried to use weak reference cache instead of the default soft
>>>>> one?
>>>>> in your config.properties add
>>>>> cache_type = weak
>>>>> that should take care of your memory problems (and the stopping which is
>>>>> actually the GC trying to reclaim memory).
>>>>>
>>>>> Am 09.06.2011 um 22:36 schrieb Paul Bandler:
>>>>>
>>>>>> I ran Michael’s example test import program with the Map replacing the
>>>>>> index on my on more modestly configured machine to see whether the
>>>>>> import scaling problems I have reported previously using Batchinserter
>>>>>> were reproduced. They were – I gave the program 1G of heap and watched
>>>>>> it run using jconsole. It ran reasonably quickly as it consumed the in
>>>>>> an almost straight line until it neared its capacity then practically
>>>>>> stopped for about 20 minutes after which it died with an out of memory
>>>>>> error – see below.
>>>>>>
>>>>>> Now I’m not saying that Neo4j should necessarily go out of its way to
>>>>>> support very memory constrained environments, but I do think that it is
>>>>>> not unreasonable to expect its batch import mechanism not to fall over
>>>>>> in this way but should rather flush its buffers or whatever without
>>>>>> requiring the import application writer to shut it down and restart it
>>>>>> periodically...
>>>>>>
>>>>>> Creating data took 331 seconds
>>>>>> 1000000 nodes created. Took 29001
>>>>>> 2000000 nodes created. Took 35107
>>>>>> 3000000 nodes created. Took 35904
>>>>>> 4000000 nodes created. Took 66169
>>>>>> 5000000 nodes created. Took 63280
>>>>>> 6000000 nodes created. Took 183922
>>>>>> 7000000 nodes created. Took 258276
>>>>>>
>>>>>> com.nomura.smo.rdm.neo4j.restore.Hepper
>>>>>> createData(330.364seconds)
>>>>>> runImport (1,485 seconds later...)
>>>>>> java.lang.OutOfMemoryError: Java heap space
>>>>>> at java.util.ArrayList.<init>(Unknown Source)
>>>>>> at java.util.ArrayList.<init>(Unknown Source)
>>>>>> at
>>>>>> org.neo4j.kernel.impl.nioneo.store.PropertyRecord.<init>(PropertyRecord.java:33)
>>>>>> at
>>>>>> org.neo4j.kernel.impl.batchinsert.BatchInserterImpl.createPropertyChain(BatchInserterImpl.java:425)
>>>>>> at
>>>>>> org.neo4j.kernel.impl.batchinsert.BatchInserterImpl.createNode(BatchInserterImpl.java:143)
>>>>>> at com.nomura.smo.rdm.neo4j.restore.Hepper.runImport(Hepper.java:61)
>>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
>>>>>> at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>>>>>> at java.lang.reflect.Method.invoke(Unknown Source)
>>>>>> at
>>>>>> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
>>>>>> at
>>>>>> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
>>>>>> at
>>>>>> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
>>>>>> at
>>>>>> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
>>>>>> at
>>>>>> org.junit.runners.BlockJUnit4ClassRunner.runNotIgnored(BlockJUnit4ClassRunner.java:79)
>>>>>> at
>>>>>> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:71)
>>>>>> at
>>>>>> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:49)
>>>>>> at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193)
>>>>>> at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52)
>>>>>> at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191)
>>>>>> at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42)
>>>>>> at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184)
>>>>>> at org.junit.runners.ParentRunner.run(ParentRunner.java:236)
>>>>>> at
>>>>>> org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:49)
>>>>>> at
>>>>>> org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
>>>>>> at
>>>>>> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
>>>>>> at
>>>>>> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
>>>>>> at
>>>>>> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
>>>>>> at
>>>>>> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)
>>>>>>
>>>>>>
>>>>>> Regards,
>>>>>> Paul Bandler
>>>>>> On 9 Jun 2011, at 12:27, Michael Hunger wrote:
>>>>>>
>>>>>>> I recreated Daniels code in Java, mainly because some things were
>>>>>>> missing from his scala example.
>>>>>>>
>>>>>>> You're right that the index is the bottleneck. But with your small data
>>>>>>> set it should be possible to cache the 10m nodes in a heap that fits in
>>>>>>> your machine.
>>>>>>>
>>>>>>> I ran it first with the index and had about 8 seconds / 1M nodes and
>>>>>>> 320 sec/1M rels.
>>>>>>>
>>>>>>> Then I switched to 3G heap and a HashMap to keep the name=>node lookup
>>>>>>> and it went to 2s/1M nodes and 13 down-to 3 sec for 1M rels.
>>>>>>>
>>>>>>> That is the approach that Chris takes only that his solution can
>>>>>>> persist the map to disk and is more efficient :)
>>>>>>>
>>>>>>> Hope that helps.
>>>>>>>
>>>>>>> Michael
>>>>>>>
>>>>>>> package org.neo4j.load;
>>>>>>>
>>>>>>> import org.apache.commons.io.FileUtils;
>>>>>>> import org.junit.Test;
>>>>>>> import org.neo4j.graphdb.RelationshipType;
>>>>>>> import org.neo4j.graphdb.index.BatchInserterIndex;
>>>>>>> import org.neo4j.graphdb.index.BatchInserterIndexProvider;
>>>>>>> import org.neo4j.helpers.collection.MapUtil;
>>>>>>> import org.neo4j.index.impl.lucene.LuceneBatchInserterIndexProvider;
>>>>>>> import org.neo4j.kernel.impl.batchinsert.BatchInserter;
>>>>>>> import org.neo4j.kernel.impl.batchinsert.BatchInserterImpl;
>>>>>>>
>>>>>>> import java.io.*;
>>>>>>> import java.util.HashMap;
>>>>>>> import java.util.Map;
>>>>>>> import java.util.Random;
>>>>>>>
>>>>>>> /**
>>>>>>> * @author mh
>>>>>>> * @since 09.06.11
>>>>>>> */
>>>>>>> public class Hepper {
>>>>>>>
>>>>>>> public static final int REPORT_COUNT = Config.MILLION;
>>>>>>>
>>>>>>> enum MyRelationshipTypes implements RelationshipType {
>>>>>>> BELONGS_TO
>>>>>>> }
>>>>>>>
>>>>>>> public static final int COUNT = Config.MILLION * 10;
>>>>>>>
>>>>>>> @Test
>>>>>>> public void createData() throws IOException {
>>>>>>> long time = System.currentTimeMillis();
>>>>>>> final PrintWriter writer = new PrintWriter(new BufferedWriter(new
>>>>>>> FileWriter("data.txt")));
>>>>>>> Random r = new Random(-1L);
>>>>>>> for (int nodes = 0; nodes < COUNT; nodes++) {
>>>>>>> writer.printf("%07d|%07d|%07d%n", nodes, r.nextInt(COUNT),
>>>>>>> r.nextInt(COUNT));
>>>>>>> }
>>>>>>> writer.close();
>>>>>>> System.out.println("Creating data took "+ (System.currentTimeMillis()
>>>>>>> - time) / 1000 +" seconds");
>>>>>>> }
>>>>>>>
>>>>>>> @Test
>>>>>>> public void runImport() throws IOException {
>>>>>>> Map<String,Long> cache=new HashMap<String, Long>(COUNT);
>>>>>>> final File storeDir = new File("target/hepper");
>>>>>>> FileUtils.deleteDirectory(storeDir);
>>>>>>> BatchInserter inserter = new
>>>>>>> BatchInserterImpl(storeDir.getAbsolutePath());
>>>>>>> final BatchInserterIndexProvider indexProvider = new
>>>>>>> LuceneBatchInserterIndexProvider(inserter);
>>>>>>> final BatchInserterIndex index = indexProvider.nodeIndex("pages",
>>>>>>> MapUtil.stringMap("type", "exact"));
>>>>>>> BufferedReader reader = new BufferedReader(new FileReader("data.txt"));
>>>>>>> String line = null;
>>>>>>> int nodes = 0;
>>>>>>> long time = System.currentTimeMillis();
>>>>>>> long batchTime=time;
>>>>>>> while ((line = reader.readLine()) != null) {
>>>>>>> final String[] nodeNames = line.split("\\|");
>>>>>>> final String name = nodeNames[0];
>>>>>>> final Map<String, Object> props = MapUtil.map("name", name);
>>>>>>> final long node = inserter.createNode(props);
>>>>>>> //index.add(node, props);
>>>>>>> cache.put(name,node);
>>>>>>> nodes++;
>>>>>>> if ((nodes % REPORT_COUNT) == 0) {
>>>>>>> System.out.printf("%d nodes created. Took %d %n", nodes,
>>>>>>> (System.currentTimeMillis() - batchTime));
>>>>>>> batchTime = System.currentTimeMillis();
>>>>>>> }
>>>>>>> }
>>>>>>>
>>>>>>> System.out.println("Creating nodes took "+ (System.currentTimeMillis()
>>>>>>> - time) / 1000);
>>>>>>> index.flush();
>>>>>>> reader.close();
>>>>>>> reader = new BufferedReader(new FileReader("data.txt"));
>>>>>>> int rels = 0;
>>>>>>> time = System.currentTimeMillis();
>>>>>>> batchTime=time;
>>>>>>> while ((line = reader.readLine()) != null) {
>>>>>>> final String[] nodeNames = line.split("\\|");
>>>>>>> final String name = nodeNames[0];
>>>>>>> //final Long from = index.get("name", name).getSingle();
>>>>>>> Long from =cache.get(name);
>>>>>>> for (int j = 1; j < nodeNames.length; j++) {
>>>>>>> //final Long to = index.get("name", nodeNames[j]).getSingle();
>>>>>>> final Long to = cache.get(name);
>>>>>>> inserter.createRelationship(from, to,
>>>>>>> MyRelationshipTypes.BELONGS_TO,null);
>>>>>>> }
>>>>>>> rels++;
>>>>>>> if ((rels % REPORT_COUNT) == 0) {
>>>>>>> System.out.printf("%d relationships created. Took %d %n",
>>>>>>> rels, (System.currentTimeMillis() - batchTime));
>>>>>>> batchTime = System.currentTimeMillis();
>>>>>>> }
>>>>>>> }
>>>>>>> System.out.println("Creating relationships took "+
>>>>>>> (System.currentTimeMillis() - time) / 1000);
>>>>>>> }
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>> 1000000 nodes created. Took 2227
>>>>>>> 2000000 nodes created. Took 1930
>>>>>>> 3000000 nodes created. Took 1818
>>>>>>> 4000000 nodes created. Took 1966
>>>>>>> 5000000 nodes created. Took 1857
>>>>>>> 6000000 nodes created. Took 2009
>>>>>>> 7000000 nodes created. Took 2068
>>>>>>> 8000000 nodes created. Took 1991
>>>>>>> 9000000 nodes created. Took 2151
>>>>>>> 10000000 nodes created. Took 2276
>>>>>>> Creating nodes took 20
>>>>>>> 1000000 relationships created. Took 13441
>>>>>>> 2000000 relationships created. Took 12887
>>>>>>> 3000000 relationships created. Took 12922
>>>>>>> 4000000 relationships created. Took 13149
>>>>>>> 5000000 relationships created. Took 14177
>>>>>>> 6000000 relationships created. Took 3377
>>>>>>> 7000000 relationships created. Took 2932
>>>>>>> 8000000 relationships created. Took 2991
>>>>>>> 9000000 relationships created. Took 2992
>>>>>>> 10000000 relationships created. Took 2912
>>>>>>> Creating relationships took 81
>>>>>>>
>>>>>>> Am 09.06.2011 um 12:51 schrieb Chris Gioran:
>>>>>>>
>>>>>>>> Hi Daniel,
>>>>>>>>
>>>>>>>> I am working currently on a tool for importing big data sets into
>>>>>>>> Neo4j graphs.
>>>>>>>> The main problem in such operations is that the usual index
>>>>>>>> implementations are just too
>>>>>>>> slow for retrieving the mapping from keys to created node ids, so a
>>>>>>>> custom solution is
>>>>>>>> needed, that is dependent to a varying degree on the distribution of
>>>>>>>> values of the input set.
>>>>>>>>
>>>>>>>> While your dataset is smaller than the data sizes i deal with, i would
>>>>>>>> like to use it as a test case. If you could
>>>>>>>> provide somehow the actual data or something that emulates them, I
>>>>>>>> would be grateful.
>>>>>>>>
>>>>>>>> If you want to see my approach, it is available here
>>>>>>>>
>>>>>>>> https://github.com/digitalstain/BigDataImport
>>>>>>>>
>>>>>>>> The core algorithm is an XJoin style two-level-hashing scheme with
>>>>>>>> adaptable eviction strategies but it is not production ready yet,
>>>>>>>> mainly from an API perspective.
>>>>>>>>
>>>>>>>> You can contact me directly for any details regarding this issue.
>>>>>>>>
>>>>>>>> cheers,
>>>>>>>> CG
>>>>>>>>
>>>>>>>> On Thu, Jun 9, 2011 at 12:59 PM, Daniel Hepper
>>>>>>>> <[email protected]> wrote:
>>>>>>>>> Hi all,
>>>>>>>>>
>>>>>>>>> I'm struggling with importing a graph with about 10m nodes and 20m
>>>>>>>>> relationships, with nodes having 0 to 10 relationships. Creating the
>>>>>>>>> nodes takes about 10 minutes, but creating the relationships is slower
>>>>>>>>> by several orders of magnitude. I'm using a 2.4 GHz i7 MacBookPro with
>>>>>>>>> 4GB RAM and conventional HDD.
>>>>>>>>>
>>>>>>>>> The graph is stored as adjacency list in a text file where each line
>>>>>>>>> has this form:
>>>>>>>>>
>>>>>>>>> Foo|Bar|Baz
>>>>>>>>> (Node Foo has relations to Bar and Baz)
>>>>>>>>>
>>>>>>>>> My current approach is to iterate over the whole file twice. In the
>>>>>>>>> first run, I create a node with the property "name" for the first
>>>>>>>>> entry in the line (Foo in this case) and add it to an index.
>>>>>>>>> In the second run, I get the start node and the end nodes from the
>>>>>>>>> index by name and create the relationships.
>>>>>>>>>
>>>>>>>>> My code can be found here: http://pastie.org/2041801
>>>>>>>>>
>>>>>>>>> With my approach, the best I can achieve is 100 created relationships
>>>>>>>>> per second.
>>>>>>>>> I experimented with mapped memory settings, but without much effect.
>>>>>>>>> Is this the speed I can expect?
>>>>>>>>> Any advice on how to speed up this process?
>>>>>>>>>
>>>>>>>>> Best regards,
>>>>>>>>> Daniel Hepper
>>>>>>>>> _______________________________________________
>>>>>>>>> Neo4j mailing list
>>>>>>>>> [email protected]
>>>>>>>>> https://lists.neo4j.org/mailman/listinfo/user
>>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Neo4j mailing list
>>>>>>>> [email protected]
>>>>>>>> https://lists.neo4j.org/mailman/listinfo/user
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Neo4j mailing list
>>>>>>> [email protected]
>>>>>>> https://lists.neo4j.org/mailman/listinfo/user
>>>>>>
>>>>>> _______________________________________________
>>>>>> Neo4j mailing list
>>>>>> [email protected]
>>>>>> https://lists.neo4j.org/mailman/listinfo/user
>>>>>
>>>>> _______________________________________________
>>>>> Neo4j mailing list
>>>>> [email protected]
>>>>> https://lists.neo4j.org/mailman/listinfo/user
>>>>
>>>> _______________________________________________
>>>> Neo4j mailing list
>>>> [email protected]
>>>> https://lists.neo4j.org/mailman/listinfo/user
>>>
>>> _______________________________________________
>>> Neo4j mailing list
>>> [email protected]
>>> https://lists.neo4j.org/mailman/listinfo/user
>>
>> _______________________________________________
>> Neo4j mailing list
>> [email protected]
>> https://lists.neo4j.org/mailman/listinfo/user
>
> _______________________________________________
> Neo4j mailing list
> [email protected]
> https://lists.neo4j.org/mailman/listinfo/user
_______________________________________________
Neo4j mailing list
[email protected]
https://lists.neo4j.org/mailman/listinfo/user