Hi,
thanks for your fast answer.
Right now I'm using lucene for 6M authors, but my whole dataset consists of
nearly 25M authors.
Can i use lucene there also, because I think this getting really slow to
check if a user already exists.
How can I change my heap memory settings and my memory-map settings, cause
I'm using the transactional mode?
Cause I think with 25M authors I will get a OutOfMemory Exception.
Here is my code that I have already written so far:
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import org.neo4j.graphdb.GraphDatabaseService;
import org.neo4j.graphdb.Node;
import org.neo4j.graphdb.Relationship;
import org.neo4j.graphdb.Transaction;
import org.neo4j.graphdb.index.Index;
import org.neo4j.graphdb.index.IndexHits;
import org.neo4j.graphdb.index.IndexManager;
import org.neo4j.kernel.EmbeddedGraphDatabase;
public class WikiGraphRegUser {
/**
* @param args
*/
public static void main(String[] args) throws IOException {
BufferedReader bf = new BufferedReader(new FileReader(
"E:/wiki0.csv"));
WikiGraphRegUser wgru = new WikiGraphRegUser();
wgru.createGraphDatabase(bf);
}
private String articleName = "";
private GraphDatabaseService db;
private IndexManager index;
private Index<Node> authorList;
private int transactionCounter = 0;
private Node article;
private boolean isFirstAuthor = false;
private Node author;
private Relationship relationship;
private int node;
private void createGraphDatabase(BufferedReader bf) {
db = new EmbeddedGraphDatabase("target/db");
index = db.index();
authorList = index.forNodes("Author");
String zeile;
Transaction tx = db.beginTx();
try {
// reads lines of CSV-file
while ((zeile = bf.readLine()) != null) {
if (transactionCounter++ % 50000 == 0) {
tx.success();
tx.finish();
tx = db.beginTx();
}
// String[] looks like this: Article%;%
Timestamp%;% Author
String[] artikelinfo = zeile.split("%;% ");
if (artikelinfo.length != 3) {
System.out.println("ERROR: check CSV");
for (int i = 0; i < artikelinfo.length;
i++) {
System.out.println(artikelinfo[i]);
}
return;
}
if (articleName == "") {
// create Article and connect with
ReferenceNode
article = createArticle(artikelinfo[0],
db.getReferenceNode(),
MyRelationshipTypes.ARTICLE);
articleName = artikelinfo[0];
isFirstAuthor = true;
} else if (!articleName.equals(artikelinfo[0]))
{
// create Article and connect with
ReferenceNode
article = createArticle(artikelinfo[0],
db.getReferenceNode(),
MyRelationshipTypes.ARTICLE);
articleName = artikelinfo[0];
isFirstAuthor = true;
}
// checks if author already exists
IndexHits<Node> hits = authorList.get("Author",
artikelinfo[2]);
// if new author
if (hits.size() == 0) {
if (isFirstAuthor) {
// creates author and connects
him with an article
author =
createAndConnectNode(artikelinfo[2], article,
MyRelationshipTypes.WROTE, artikelinfo[1]);
isFirstAuthor = false;
} else {
author =
createAndConnectNode(artikelinfo[2], article,
MyRelationshipTypes.EDIT, artikelinfo[1]);
}
} else {
// author already exists
if (isFirstAuthor) {
// create relationship to
article
relationship =
hits.getSingle().createRelationshipTo(
article,
MyRelationshipTypes.WROTE);
relationship.setProperty("Timestamp", artikelinfo[1]);
isFirstAuthor = false;
} else {
relationship =
hits.getSingle().createRelationshipTo(
article,
MyRelationshipTypes.EDIT);
relationship.setProperty("Timestamp", artikelinfo[1]);
}
}
tx.success();
}
} catch (Exception e) {
tx.failure();
} finally {
tx.finish();
}
db.shutdown();
}
/**
* creates an article and connect it with reference node
*
* @param name
* Article
* @param reference
* @param relationship
* Type
* @return Article node
*/
private Node createArticle(String name, Node reference,
MyRelationshipTypes relationship) {
Node node = db.createNode();
node.setProperty("Article", name);
reference.createRelationshipTo(node, relationship);
return node;
}
/**
* creates an author node and connects him with an article
*
* @param name
* Author
* @param otherNode
* Article
* @param relationshipType
* Type
* @param timestamp
* Timestamp
* @return new Author node
*/
private Node createAndConnectNode(String name, Node otherNode,
MyRelationshipTypes relationshipType, String timestamp)
{
Node node = db.createNode();
node.setProperty("Name", name);
authorList.add(node, "Author", name);
relationship = node.createRelationshipTo(otherNode,
relationshipType);
relationship.setProperty("Timestamp", timestamp);
return node;
}
}
Maybe you know some optimizations here, cause my database is already 20GB
big with 6M authors and 20M aritcles.
The graph looks like this so far:
ReferenceNode ---> Article <---- (Wrote/Edit) Author
About the distributed system, I first have to check that at the University.
We just thought about that, because we want to do some requests to the
database like PageRank or Nodedegree and within a distributed system these
requests should be faster.
Thank you for your help again,
Stephan
--
View this message in context:
http://neo4j-community-discussions.438527.n3.nabble.com/How-to-create-a-graph-database-out-of-a-huge-dataset-tp3177076p3177349.html
Sent from the Neo4J Community Discussions mailing list archive at Nabble.com.
_______________________________________________
Neo4j mailing list
[email protected]
https://lists.neo4j.org/mailman/listinfo/user