Re: [Neo4j] How to create a graph database out of a huge dataset?

Michael Hunger Sun, 17 Jul 2011 13:47:55 -0700

Stephan,

can you perhaps share your csv file or give at least a few sample lines and a 
typical distribution (articles per author etc). You tested this with 20M 
arcticles and 6M authors? What is the current runtime of that import with which 
kind of hardware? (when working on a similar test I noticed that a lot of time 
was spent in the parsing, so it might be sensible to write a small converter to 
convert that csv file to a file with one entry per line and at the same time 
filter out invalid entries (wite them to a separate file). (If I read your code 
correctly you also have a number of authors per article on following lines 
duplicating the article name just with another author ? That might be also 
handled by the separate file, which could contain the # of author or have a 
blank line between each new article.


I don't know if your input data is presorted in any way. So if you might have 
recurring authors within a certain set you might also enable the caching of 
author index lookups.

((LuceneIndex)authorList).setCacheCapacity("Author",10000);

you just specifiy your java heap memory with -Xmx3G or such. 

For your memory mapped files, it would be good to know how much memory your 
machine has. Otherwise see here: 
(http://docs.neo4j.org/chunked/1.4/configuration.html)

Something that might incur performance is creating the super-category nodes for 
articles and authors which then have 20M+ relationships. Perhaps it is better 
to split them by some sharding key to subnodes, or use an index for those too.

It might also be better to reduce your tx size a bit to 10k.

Something I noticed. You do the index get, check for the size and do 
getSingle() afterwards. I'm not sure if not closing the result iterator in the 
other case incurs any problems. So probably change that to 

Node author= authorList.get("Author", artikelinfo[2]).getSingle();

which automatically closes the result and returns null if there is none.

Regarding database size. What is the size and format of the timestamp string? 
It might be sensible to convert that to a long before storing it in the graph 
or at least make sure it
fits into the shortstring compression: 
(http://docs.neo4j.org/chunked/1.4/short-strings.html) depends on how you will 
use that in traversals.

I'll take your code and will run it locally to see where the performance issues 
are.

Cheers

Michael

Am 17.07.2011 um 20:53 schrieb st3ven:

> Hi,
> 
> thanks for your fast answer.
> Right now I'm using lucene for 6M authors, but my whole dataset consists of
> nearly 25M authors.
> Can i use lucene there also, because I think this getting really slow to
> check if a user already exists.
> How can I change my heap memory settings and my memory-map settings, cause
> I'm using the transactional mode? 
> Cause I think with 25M authors I will get a OutOfMemory Exception.
> 
> Here is my code that I have already written so far:
> 
> import java.io.BufferedReader;
> import java.io.FileReader;
> import java.io.IOException;
> 
> import org.neo4j.graphdb.GraphDatabaseService;
> import org.neo4j.graphdb.Node;
> import org.neo4j.graphdb.Relationship;
> import org.neo4j.graphdb.Transaction;
> import org.neo4j.graphdb.index.Index;
> import org.neo4j.graphdb.index.IndexHits;
> import org.neo4j.graphdb.index.IndexManager;
> import org.neo4j.kernel.EmbeddedGraphDatabase;
> 
> public class WikiGraphRegUser {
> 
>       /**
>        * @param args
>        */
>       public static void main(String[] args) throws IOException {
> 
>               BufferedReader bf = new BufferedReader(new FileReader(
>                               "E:/wiki0.csv"));
>               WikiGraphRegUser wgru = new WikiGraphRegUser();
>               wgru.createGraphDatabase(bf);
>       }
> 
>       private String articleName = "";
>       private GraphDatabaseService db;
>       private IndexManager index;
>       private Index<Node> authorList;
>       private int transactionCounter = 0;
>       private Node article;
>       private boolean isFirstAuthor = false;
>       private Node author;
>       private Relationship relationship;
>       private int node;
> 
>       private void createGraphDatabase(BufferedReader bf) {
>               db = new EmbeddedGraphDatabase("target/db");
>               index = db.index();
>               authorList = index.forNodes("Author");
> 
>               String zeile;
>               Transaction tx = db.beginTx();
> 
>               try {
>                       // reads lines of CSV-file
>                       while ((zeile = bf.readLine()) != null) {
>                               if (transactionCounter++ % 50000 == 0) {
> 
>                                       tx.success();
>                                       tx.finish();
>                                       tx = db.beginTx();
>                               }
>                               // String[] looks like this: Article%;% 
> Timestamp%;% Author
>                               String[] artikelinfo = zeile.split("%;% ");
>                               if (artikelinfo.length != 3) {
>                                       System.out.println("ERROR: check CSV");
>                                       for (int i = 0; i < artikelinfo.length; 
> i++) {
>                                               
> System.out.println(artikelinfo[i]);
>                                       }
>                                       return;
>                               }
> 
>                               if (articleName == "") {
>                                       // create Article and connect with 
> ReferenceNode
>                                       article = createArticle(artikelinfo[0],
>                                                       db.getReferenceNode(), 
> MyRelationshipTypes.ARTICLE);
>                                       articleName = artikelinfo[0];
> 
>                                       isFirstAuthor = true;
> 
>                               } else if (!articleName.equals(artikelinfo[0])) 
> {
>                                       // create Article and connect with 
> ReferenceNode
>                                       article = createArticle(artikelinfo[0],
>                                                       db.getReferenceNode(), 
> MyRelationshipTypes.ARTICLE);
>                                       articleName = artikelinfo[0];
>                                       isFirstAuthor = true;
>                               }
>                               // checks if author already exists
>                               IndexHits<Node> hits = authorList.get("Author", 
> artikelinfo[2]);
>                               // if new author
>                               if (hits.size() == 0) {
>                                       if (isFirstAuthor) {
>                                               // creates author and connects 
> him with an article
>                                               author = 
> createAndConnectNode(artikelinfo[2], article,
>                                                               
> MyRelationshipTypes.WROTE, artikelinfo[1]);
>                                               isFirstAuthor = false;
>                                       } else {
> 
>                                               author = 
> createAndConnectNode(artikelinfo[2], article,
>                                                               
> MyRelationshipTypes.EDIT, artikelinfo[1]);
>                                       }
> 
>                               } else {
>                                       // author already exists
>                                       if (isFirstAuthor) {
>                                               // create relationship to 
> article
>                                               relationship = 
> hits.getSingle().createRelationshipTo(
>                                                               article, 
> MyRelationshipTypes.WROTE);
>                                               
> relationship.setProperty("Timestamp", artikelinfo[1]);
>                                               isFirstAuthor = false;
>                                       } else {
>                                               relationship = 
> hits.getSingle().createRelationshipTo(
>                                                               article, 
> MyRelationshipTypes.EDIT);
>                                               
> relationship.setProperty("Timestamp", artikelinfo[1]);
>                                       }
> 
>                               }
> 
>                               tx.success();
>                       }
>               } catch (Exception e) {
>                       tx.failure();
>               } finally {
>                       tx.finish();
>               }
>               db.shutdown();
> 
>       }
> 
>       /**
>        * creates an article and connect it with reference node
>        * 
>        * @param name
>        *            Article
>        * @param reference
>        * @param relationship
>        *            Type
>        * @return Article node
>        */
>       private Node createArticle(String name, Node reference,
>                       MyRelationshipTypes relationship) {
>               Node node = db.createNode();
>               node.setProperty("Article", name);
> 
>               reference.createRelationshipTo(node, relationship);
>               return node;
>       }
> 
>       /**
>        * creates an author node and connects him with an article
>        * 
>        * @param name
>        *            Author
>        * @param otherNode
>        *            Article
>        * @param relationshipType
>        *            Type
>        * @param timestamp
>        *            Timestamp
>        * @return new Author node
>        */
>       private Node createAndConnectNode(String name, Node otherNode,
>                       MyRelationshipTypes relationshipType, String timestamp) 
> {
> 
>               Node node = db.createNode();
>               node.setProperty("Name", name);
>               authorList.add(node, "Author", name);
>               relationship = node.createRelationshipTo(otherNode, 
> relationshipType);
>               relationship.setProperty("Timestamp", timestamp);
> 
>               return node;
>       }
> 
> }
> 
> Maybe you know some optimizations here, cause my database is already 20GB
> big with 6M authors and 20M aritcles.
> The graph looks like this so far:
> 
> ReferenceNode ---> Article <---- (Wrote/Edit) Author
> 
> 
> About the distributed system, I first have to check that at the University.
> We just thought about that, because we want to do some requests to the
> database like PageRank or Nodedegree and within a distributed system these
> requests should be faster.
> 
> 
> Thank you for your help again,
> Stephan 
> 
> --
> View this message in context: 
> http://neo4j-community-discussions.438527.n3.nabble.com/How-to-create-a-graph-database-out-of-a-huge-dataset-tp3177076p3177349.html
> Sent from the Neo4J Community Discussions mailing list archive at Nabble.com.
> _______________________________________________
> Neo4j mailing list
> [email protected]
> https://lists.neo4j.org/mailman/listinfo/user

_______________________________________________
Neo4j mailing list
[email protected]
https://lists.neo4j.org/mailman/listinfo/user

Re: [Neo4j] How to create a graph database out of a huge dataset?

Reply via email to