Stephan,
can you perhaps share your csv file or give at least a few sample lines and a
typical distribution (articles per author etc). You tested this with 20M
arcticles and 6M authors? What is the current runtime of that import with which
kind of hardware? (when working on a similar test I noticed that a lot of time
was spent in the parsing, so it might be sensible to write a small converter to
convert that csv file to a file with one entry per line and at the same time
filter out invalid entries (wite them to a separate file). (If I read your code
correctly you also have a number of authors per article on following lines
duplicating the article name just with another author ? That might be also
handled by the separate file, which could contain the # of author or have a
blank line between each new article.
I don't know if your input data is presorted in any way. So if you might have
recurring authors within a certain set you might also enable the caching of
author index lookups.
((LuceneIndex)authorList).setCacheCapacity("Author",10000);
you just specifiy your java heap memory with -Xmx3G or such.
For your memory mapped files, it would be good to know how much memory your
machine has. Otherwise see here:
(http://docs.neo4j.org/chunked/1.4/configuration.html)
Something that might incur performance is creating the super-category nodes for
articles and authors which then have 20M+ relationships. Perhaps it is better
to split them by some sharding key to subnodes, or use an index for those too.
It might also be better to reduce your tx size a bit to 10k.
Something I noticed. You do the index get, check for the size and do
getSingle() afterwards. I'm not sure if not closing the result iterator in the
other case incurs any problems. So probably change that to
Node author= authorList.get("Author", artikelinfo[2]).getSingle();
which automatically closes the result and returns null if there is none.
Regarding database size. What is the size and format of the timestamp string?
It might be sensible to convert that to a long before storing it in the graph
or at least make sure it
fits into the shortstring compression:
(http://docs.neo4j.org/chunked/1.4/short-strings.html) depends on how you will
use that in traversals.
I'll take your code and will run it locally to see where the performance issues
are.
Cheers
Michael
Am 17.07.2011 um 20:53 schrieb st3ven:
> Hi,
>
> thanks for your fast answer.
> Right now I'm using lucene for 6M authors, but my whole dataset consists of
> nearly 25M authors.
> Can i use lucene there also, because I think this getting really slow to
> check if a user already exists.
> How can I change my heap memory settings and my memory-map settings, cause
> I'm using the transactional mode?
> Cause I think with 25M authors I will get a OutOfMemory Exception.
>
> Here is my code that I have already written so far:
>
> import java.io.BufferedReader;
> import java.io.FileReader;
> import java.io.IOException;
>
> import org.neo4j.graphdb.GraphDatabaseService;
> import org.neo4j.graphdb.Node;
> import org.neo4j.graphdb.Relationship;
> import org.neo4j.graphdb.Transaction;
> import org.neo4j.graphdb.index.Index;
> import org.neo4j.graphdb.index.IndexHits;
> import org.neo4j.graphdb.index.IndexManager;
> import org.neo4j.kernel.EmbeddedGraphDatabase;
>
> public class WikiGraphRegUser {
>
> /**
> * @param args
> */
> public static void main(String[] args) throws IOException {
>
> BufferedReader bf = new BufferedReader(new FileReader(
> "E:/wiki0.csv"));
> WikiGraphRegUser wgru = new WikiGraphRegUser();
> wgru.createGraphDatabase(bf);
> }
>
> private String articleName = "";
> private GraphDatabaseService db;
> private IndexManager index;
> private Index<Node> authorList;
> private int transactionCounter = 0;
> private Node article;
> private boolean isFirstAuthor = false;
> private Node author;
> private Relationship relationship;
> private int node;
>
> private void createGraphDatabase(BufferedReader bf) {
> db = new EmbeddedGraphDatabase("target/db");
> index = db.index();
> authorList = index.forNodes("Author");
>
> String zeile;
> Transaction tx = db.beginTx();
>
> try {
> // reads lines of CSV-file
> while ((zeile = bf.readLine()) != null) {
> if (transactionCounter++ % 50000 == 0) {
>
> tx.success();
> tx.finish();
> tx = db.beginTx();
> }
> // String[] looks like this: Article%;%
> Timestamp%;% Author
> String[] artikelinfo = zeile.split("%;% ");
> if (artikelinfo.length != 3) {
> System.out.println("ERROR: check CSV");
> for (int i = 0; i < artikelinfo.length;
> i++) {
>
> System.out.println(artikelinfo[i]);
> }
> return;
> }
>
> if (articleName == "") {
> // create Article and connect with
> ReferenceNode
> article = createArticle(artikelinfo[0],
> db.getReferenceNode(),
> MyRelationshipTypes.ARTICLE);
> articleName = artikelinfo[0];
>
> isFirstAuthor = true;
>
> } else if (!articleName.equals(artikelinfo[0]))
> {
> // create Article and connect with
> ReferenceNode
> article = createArticle(artikelinfo[0],
> db.getReferenceNode(),
> MyRelationshipTypes.ARTICLE);
> articleName = artikelinfo[0];
> isFirstAuthor = true;
> }
> // checks if author already exists
> IndexHits<Node> hits = authorList.get("Author",
> artikelinfo[2]);
> // if new author
> if (hits.size() == 0) {
> if (isFirstAuthor) {
> // creates author and connects
> him with an article
> author =
> createAndConnectNode(artikelinfo[2], article,
>
> MyRelationshipTypes.WROTE, artikelinfo[1]);
> isFirstAuthor = false;
> } else {
>
> author =
> createAndConnectNode(artikelinfo[2], article,
>
> MyRelationshipTypes.EDIT, artikelinfo[1]);
> }
>
> } else {
> // author already exists
> if (isFirstAuthor) {
> // create relationship to
> article
> relationship =
> hits.getSingle().createRelationshipTo(
> article,
> MyRelationshipTypes.WROTE);
>
> relationship.setProperty("Timestamp", artikelinfo[1]);
> isFirstAuthor = false;
> } else {
> relationship =
> hits.getSingle().createRelationshipTo(
> article,
> MyRelationshipTypes.EDIT);
>
> relationship.setProperty("Timestamp", artikelinfo[1]);
> }
>
> }
>
> tx.success();
> }
> } catch (Exception e) {
> tx.failure();
> } finally {
> tx.finish();
> }
> db.shutdown();
>
> }
>
> /**
> * creates an article and connect it with reference node
> *
> * @param name
> * Article
> * @param reference
> * @param relationship
> * Type
> * @return Article node
> */
> private Node createArticle(String name, Node reference,
> MyRelationshipTypes relationship) {
> Node node = db.createNode();
> node.setProperty("Article", name);
>
> reference.createRelationshipTo(node, relationship);
> return node;
> }
>
> /**
> * creates an author node and connects him with an article
> *
> * @param name
> * Author
> * @param otherNode
> * Article
> * @param relationshipType
> * Type
> * @param timestamp
> * Timestamp
> * @return new Author node
> */
> private Node createAndConnectNode(String name, Node otherNode,
> MyRelationshipTypes relationshipType, String timestamp)
> {
>
> Node node = db.createNode();
> node.setProperty("Name", name);
> authorList.add(node, "Author", name);
> relationship = node.createRelationshipTo(otherNode,
> relationshipType);
> relationship.setProperty("Timestamp", timestamp);
>
> return node;
> }
>
> }
>
> Maybe you know some optimizations here, cause my database is already 20GB
> big with 6M authors and 20M aritcles.
> The graph looks like this so far:
>
> ReferenceNode ---> Article <---- (Wrote/Edit) Author
>
>
> About the distributed system, I first have to check that at the University.
> We just thought about that, because we want to do some requests to the
> database like PageRank or Nodedegree and within a distributed system these
> requests should be faster.
>
>
> Thank you for your help again,
> Stephan
>
> --
> View this message in context:
> http://neo4j-community-discussions.438527.n3.nabble.com/How-to-create-a-graph-database-out-of-a-huge-dataset-tp3177076p3177349.html
> Sent from the Neo4J Community Discussions mailing list archive at Nabble.com.
> _______________________________________________
> Neo4j mailing list
> [email protected]
> https://lists.neo4j.org/mailman/listinfo/user
_______________________________________________
Neo4j mailing list
[email protected]
https://lists.neo4j.org/mailman/listinfo/user