I'm not sure it's such a good idea to call tx.success() on every iteration
of the loop. I suggest call it only in the commit, and after the loop (ie.
move it two lines down).

Also I think a commit size of 50k it a little large. You're probably not
going to see much improvement past 10k. In fact I generally only use 1k
myself (but I hear 10k is popular too :-)

On Sun, Jul 17, 2011 at 8:53 PM, st3ven <st3...@web.de> wrote:

> Hi,
>
> thanks for your fast answer.
> Right now I'm using lucene for 6M authors, but my whole dataset consists of
> nearly 25M authors.
> Can i use lucene there also, because I think this getting really slow to
> check if a user already exists.
> How can I change my heap memory settings and my memory-map settings, cause
> I'm using the transactional mode?
> Cause I think with 25M authors I will get a OutOfMemory Exception.
>
> Here is my code that I have already written so far:
>
> import java.io.BufferedReader;
> import java.io.FileReader;
> import java.io.IOException;
>
> import org.neo4j.graphdb.GraphDatabaseService;
> import org.neo4j.graphdb.Node;
> import org.neo4j.graphdb.Relationship;
> import org.neo4j.graphdb.Transaction;
> import org.neo4j.graphdb.index.Index;
> import org.neo4j.graphdb.index.IndexHits;
> import org.neo4j.graphdb.index.IndexManager;
> import org.neo4j.kernel.EmbeddedGraphDatabase;
>
> public class WikiGraphRegUser {
>
>        /**
>         * @param args
>         */
>        public static void main(String[] args) throws IOException {
>
>                BufferedReader bf = new BufferedReader(new FileReader(
>                                "E:/wiki0.csv"));
>                WikiGraphRegUser wgru = new WikiGraphRegUser();
>                wgru.createGraphDatabase(bf);
>        }
>
>        private String articleName = "";
>        private GraphDatabaseService db;
>        private IndexManager index;
>        private Index<Node> authorList;
>        private int transactionCounter = 0;
>        private Node article;
>        private boolean isFirstAuthor = false;
>        private Node author;
>        private Relationship relationship;
>        private int node;
>
>        private void createGraphDatabase(BufferedReader bf) {
>                db = new EmbeddedGraphDatabase("target/db");
>                index = db.index();
>                authorList = index.forNodes("Author");
>
>                String zeile;
>                Transaction tx = db.beginTx();
>
>                try {
>                        // reads lines of CSV-file
>                        while ((zeile = bf.readLine()) != null) {
>                                if (transactionCounter++ % 50000 == 0) {
>
>                                        tx.success();
>                                        tx.finish();
>                                        tx = db.beginTx();
>                                }
>                                // String[] looks like this: Article%;%
> Timestamp%;% Author
>                                String[] artikelinfo = zeile.split("%;% ");
>                                if (artikelinfo.length != 3) {
>                                        System.out.println("ERROR: check
> CSV");
>                                        for (int i = 0; i <
> artikelinfo.length; i++) {
>
>  System.out.println(artikelinfo[i]);
>                                        }
>                                        return;
>                                }
>
>                                if (articleName == "") {
>                                        // create Article and connect with
> ReferenceNode
>                                        article =
> createArticle(artikelinfo[0],
>
>  db.getReferenceNode(), MyRelationshipTypes.ARTICLE);
>                                        articleName = artikelinfo[0];
>
>                                        isFirstAuthor = true;
>
>                                } else if
> (!articleName.equals(artikelinfo[0])) {
>                                        // create Article and connect with
> ReferenceNode
>                                        article =
> createArticle(artikelinfo[0],
>
>  db.getReferenceNode(), MyRelationshipTypes.ARTICLE);
>                                        articleName = artikelinfo[0];
>                                        isFirstAuthor = true;
>                                }
>                                // checks if author already exists
>                                IndexHits<Node> hits =
> authorList.get("Author", artikelinfo[2]);
>                                // if new author
>                                if (hits.size() == 0) {
>                                        if (isFirstAuthor) {
>                                                // creates author and
> connects him with an article
>                                                author =
> createAndConnectNode(artikelinfo[2], article,
>
>  MyRelationshipTypes.WROTE, artikelinfo[1]);
>                                                isFirstAuthor = false;
>                                        } else {
>
>                                                author =
> createAndConnectNode(artikelinfo[2], article,
>
>  MyRelationshipTypes.EDIT, artikelinfo[1]);
>                                        }
>
>                                } else {
>                                        // author already exists
>                                        if (isFirstAuthor) {
>                                                // create relationship to
> article
>                                                relationship =
> hits.getSingle().createRelationshipTo(
>                                                                article,
> MyRelationshipTypes.WROTE);
>
>  relationship.setProperty("Timestamp", artikelinfo[1]);
>                                                isFirstAuthor = false;
>                                        } else {
>                                                relationship =
> hits.getSingle().createRelationshipTo(
>                                                                article,
> MyRelationshipTypes.EDIT);
>
>  relationship.setProperty("Timestamp", artikelinfo[1]);
>                                        }
>
>                                }
>
>                                tx.success();
>                        }
>                } catch (Exception e) {
>                        tx.failure();
>                } finally {
>                        tx.finish();
>                }
>                db.shutdown();
>
>        }
>
>        /**
>         * creates an article and connect it with reference node
>         *
>         * @param name
>         *            Article
>         * @param reference
>         * @param relationship
>         *            Type
>         * @return Article node
>         */
>        private Node createArticle(String name, Node reference,
>                        MyRelationshipTypes relationship) {
>                Node node = db.createNode();
>                node.setProperty("Article", name);
>
>                reference.createRelationshipTo(node, relationship);
>                return node;
>        }
>
>        /**
>         * creates an author node and connects him with an article
>         *
>         * @param name
>         *            Author
>         * @param otherNode
>         *            Article
>         * @param relationshipType
>         *            Type
>         * @param timestamp
>         *            Timestamp
>         * @return new Author node
>         */
>        private Node createAndConnectNode(String name, Node otherNode,
>                        MyRelationshipTypes relationshipType, String
> timestamp) {
>
>                Node node = db.createNode();
>                node.setProperty("Name", name);
>                authorList.add(node, "Author", name);
>                relationship = node.createRelationshipTo(otherNode,
> relationshipType);
>                relationship.setProperty("Timestamp", timestamp);
>
>                return node;
>        }
>
> }
>
> Maybe you know some optimizations here, cause my database is already 20GB
> big with 6M authors and 20M aritcles.
> The graph looks like this so far:
>
> ReferenceNode ---> Article <---- (Wrote/Edit) Author
>
>
> About the distributed system, I first have to check that at the University.
> We just thought about that, because we want to do some requests to the
> database like PageRank or Nodedegree and within a distributed system these
> requests should be faster.
>
>
> Thank you for your help again,
> Stephan
>
> --
> View this message in context:
> http://neo4j-community-discussions.438527.n3.nabble.com/How-to-create-a-graph-database-out-of-a-huge-dataset-tp3177076p3177349.html
> Sent from the Neo4J Community Discussions mailing list archive at
> Nabble.com.
> _______________________________________________
> Neo4j mailing list
> User@lists.neo4j.org
> https://lists.neo4j.org/mailman/listinfo/user
>
_______________________________________________
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user

Reply via email to