Hi,

thanks for your fast answer.
Right now I'm using lucene for 6M authors, but my whole dataset consists of
nearly 25M authors.
Can i use lucene there also, because I think this getting really slow to
check if a user already exists.
How can I change my heap memory settings and my memory-map settings, cause
I'm using the transactional mode? 
Cause I think with 25M authors I will get a OutOfMemory Exception.

Here is my code that I have already written so far:

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;

import org.neo4j.graphdb.GraphDatabaseService;
import org.neo4j.graphdb.Node;
import org.neo4j.graphdb.Relationship;
import org.neo4j.graphdb.Transaction;
import org.neo4j.graphdb.index.Index;
import org.neo4j.graphdb.index.IndexHits;
import org.neo4j.graphdb.index.IndexManager;
import org.neo4j.kernel.EmbeddedGraphDatabase;

public class WikiGraphRegUser {

        /**
         * @param args
         */
        public static void main(String[] args) throws IOException {

                BufferedReader bf = new BufferedReader(new FileReader(
                                "E:/wiki0.csv"));
                WikiGraphRegUser wgru = new WikiGraphRegUser();
                wgru.createGraphDatabase(bf);
        }

        private String articleName = "";
        private GraphDatabaseService db;
        private IndexManager index;
        private Index<Node> authorList;
        private int transactionCounter = 0;
        private Node article;
        private boolean isFirstAuthor = false;
        private Node author;
        private Relationship relationship;
        private int node;

        private void createGraphDatabase(BufferedReader bf) {
                db = new EmbeddedGraphDatabase("target/db");
                index = db.index();
                authorList = index.forNodes("Author");

                String zeile;
                Transaction tx = db.beginTx();

                try {
                        // reads lines of CSV-file
                        while ((zeile = bf.readLine()) != null) {
                                if (transactionCounter++ % 50000 == 0) {

                                        tx.success();
                                        tx.finish();
                                        tx = db.beginTx();
                                }
                                // String[] looks like this: Article%;% 
Timestamp%;% Author
                                String[] artikelinfo = zeile.split("%;% ");
                                if (artikelinfo.length != 3) {
                                        System.out.println("ERROR: check CSV");
                                        for (int i = 0; i < artikelinfo.length; 
i++) {
                                                
System.out.println(artikelinfo[i]);
                                        }
                                        return;
                                }

                                if (articleName == "") {
                                        // create Article and connect with 
ReferenceNode
                                        article = createArticle(artikelinfo[0],
                                                        db.getReferenceNode(), 
MyRelationshipTypes.ARTICLE);
                                        articleName = artikelinfo[0];

                                        isFirstAuthor = true;

                                } else if (!articleName.equals(artikelinfo[0])) 
{
                                        // create Article and connect with 
ReferenceNode
                                        article = createArticle(artikelinfo[0],
                                                        db.getReferenceNode(), 
MyRelationshipTypes.ARTICLE);
                                        articleName = artikelinfo[0];
                                        isFirstAuthor = true;
                                }
                                // checks if author already exists
                                IndexHits<Node> hits = authorList.get("Author", 
artikelinfo[2]);
                                // if new author
                                if (hits.size() == 0) {
                                        if (isFirstAuthor) {
                                                // creates author and connects 
him with an article
                                                author = 
createAndConnectNode(artikelinfo[2], article,
                                                                
MyRelationshipTypes.WROTE, artikelinfo[1]);
                                                isFirstAuthor = false;
                                        } else {

                                                author = 
createAndConnectNode(artikelinfo[2], article,
                                                                
MyRelationshipTypes.EDIT, artikelinfo[1]);
                                        }

                                } else {
                                        // author already exists
                                        if (isFirstAuthor) {
                                                // create relationship to 
article
                                                relationship = 
hits.getSingle().createRelationshipTo(
                                                                article, 
MyRelationshipTypes.WROTE);
                                                
relationship.setProperty("Timestamp", artikelinfo[1]);
                                                isFirstAuthor = false;
                                        } else {
                                                relationship = 
hits.getSingle().createRelationshipTo(
                                                                article, 
MyRelationshipTypes.EDIT);
                                                
relationship.setProperty("Timestamp", artikelinfo[1]);
                                        }

                                }

                                tx.success();
                        }
                } catch (Exception e) {
                        tx.failure();
                } finally {
                        tx.finish();
                }
                db.shutdown();

        }

        /**
         * creates an article and connect it with reference node
         * 
         * @param name
         *            Article
         * @param reference
         * @param relationship
         *            Type
         * @return Article node
         */
        private Node createArticle(String name, Node reference,
                        MyRelationshipTypes relationship) {
                Node node = db.createNode();
                node.setProperty("Article", name);

                reference.createRelationshipTo(node, relationship);
                return node;
        }

        /**
         * creates an author node and connects him with an article
         * 
         * @param name
         *            Author
         * @param otherNode
         *            Article
         * @param relationshipType
         *            Type
         * @param timestamp
         *            Timestamp
         * @return new Author node
         */
        private Node createAndConnectNode(String name, Node otherNode,
                        MyRelationshipTypes relationshipType, String timestamp) 
{

                Node node = db.createNode();
                node.setProperty("Name", name);
                authorList.add(node, "Author", name);
                relationship = node.createRelationshipTo(otherNode, 
relationshipType);
                relationship.setProperty("Timestamp", timestamp);

                return node;
        }

}

Maybe you know some optimizations here, cause my database is already 20GB
big with 6M authors and 20M aritcles.
The graph looks like this so far:

ReferenceNode ---> Article <---- (Wrote/Edit) Author


About the distributed system, I first have to check that at the University.
We just thought about that, because we want to do some requests to the
database like PageRank or Nodedegree and within a distributed system these
requests should be faster.


Thank you for your help again,
Stephan 

--
View this message in context: 
http://neo4j-community-discussions.438527.n3.nabble.com/How-to-create-a-graph-database-out-of-a-huge-dataset-tp3177076p3177349.html
Sent from the Neo4J Community Discussions mailing list archive at Nabble.com.
_______________________________________________
Neo4j mailing list
[email protected]
https://lists.neo4j.org/mailman/listinfo/user

Reply via email to