Re: [Neo4j] Neo4j performance with 400million nodes
Hi alican, I just want to report back that I was able to reproduce the problem and narrow down the cause a bit. Seems the UI and DB threads are waiting for each other ... haven't got around to fix it though. /anders 2011-11-02 07:08, algecya skrev: Hi anders, appreciate your offer very much! It is good to know that the neo4j community is very active and involved. http://neo4j-community-discussions.438527.n3.nabble.com/file/n3472966/BatchImportData.groovy BatchImportData.groovy Here is the import script. it is a stripped version of the graph I used for testing. If you need more data, just increase the variable 'amountTypeA' at line 26. -- alican -- View this message in context: http://neo4j-community-discussions.438527.n3.nabble.com/Neo4j-performance-with-400million-nodes-tp3467806p3472966.html Sent from the Neo4j Community Discussions mailing list archive at Nabble.com. ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] Neo4j performance with 400million nodes
anders, thank you very much for reporting back and looking at it! Good luck fixing the bug then -- alican -- View this message in context: http://neo4j-community-discussions.438527.n3.nabble.com/Neo4j-performance-with-400million-nodes-tp3467806p3486237.html Sent from the Neo4j Community Discussions mailing list archive at Nabble.com. ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] Neo4j performance with 400million nodes
Hi anders, appreciate your offer very much! It is good to know that the neo4j community is very active and involved. http://neo4j-community-discussions.438527.n3.nabble.com/file/n3472966/BatchImportData.groovy BatchImportData.groovy Here is the import script. it is a stripped version of the graph I used for testing. If you need more data, just increase the variable 'amountTypeA' at line 26. -- alican -- View this message in context: http://neo4j-community-discussions.438527.n3.nabble.com/Neo4j-performance-with-400million-nodes-tp3467806p3472966.html Sent from the Neo4j Community Discussions mailing list archive at Nabble.com. ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] Neo4j performance with 400million nodes
hello david, thank you for the quick reply! appreciate it very much. Am 01.11.2011 01:01, schrieb David Montag: Hi Alican, On Mon, Oct 31, 2011 at 6:26 AM, algecyaalican.gecya...@openconcept.chwrote: Hello everyone, We are relatively new to neo4j and are evaluating some test scenarios in order to decide to use neo4j in productive systems. We used the latest stable release 1.4.2. I wrote an import script and generated some random data with the given tree structure: http://neo4j-community-discussions.438527.n3.nabble.com/file/n3467806/neo4j_nodes.png Nodes Summary: Nodes with Type A: 1 Nodes with Type B: 100 Nodes with Type C: 50'000 (100x500) Nodes with Type D: 500'000 (50'000x10) Nodes with Type E: 25'000'000 (500'000x50) Nodes with Type F: 375'000'000 (25'000'000x15) This all worked quite OK, the import took approx. 30hours using the batchimport. We have multiple indexes, but we also have one index where all nodes are indexed. My first question would be, does it make sense to index all nodes with the same index? It depends on how you intend you access the data. If you always know the type, then it would be beneficial to use different indices. Otherwise you might want to put it all in a single index. Do remember that the index will consume some disk space as well. ok, we decided to create a type node for each type and let the nodes relate to it. (Instead of having the type as an attribute at each node) I guess I was thinking too much in relational database schemes. therefore we will have an index per type. If I would like to list all nodes with property type:type E it is quite slow the first time ~270s Second time it is fast ~1/2s. I know this is normal and mostlikely fixed in the current milestone version. But I am not sure how long the query will be cached in memory. Are there any configurations I should be concerned about? The difference there is all about disk access time. Will give me all 25 million E's be a common operation? We will need to find nodes with common attributes of type E , which may return approx. 1million results. But there will always be a search for different values. E.g., nodes with type E have an attribute date created and an attribute name. I will need to find all attributes created at the given date(say year 2011) and the given name (abc). The second search will be date (2011) and name (def). If certain time passes and memory is being used for other searches, I am afraid my first search (2011,abc) will be kicked out of memory and the search will take long again the next time I query for it. We also took the hardware sizing calculator. See the result here: http://neo4j-community-discussions.438527.n3.nabble.com/file/n3467806/neo4j_hardware.png Are these realistic result values? I guess 128GB RAM and 12TB SSD harddrives might be a bit cost intense. The reason that the disk usage is 12TB is because you specified that each node on average has 10kB of data, and each relationship on average has 1kB of data. What kind of data are you storing on the nodes and relationships? These are pretty rough estimates not taking into account the number of properties nor the type of them. Also, if you decrease the property data by a factor 100 (100B/node, 10B/rel), then your database will only consume ~150-200GB. Ok I see your point. I think I am getting the hang of graph-based databases now. I.e., I might not want to put all my data into attributes but create nodes instead... My rough guess was to increase the amount of nodes to a 1'000'000'000 and decrease the bytes consumed to 100B/node and 10/rel. The result is to have approx. 400GB (no problem at all). But I am still a bit concerned about the 128GB RAM.. Are there any reference applications with these amount of nodes and relations? We are in the process of adding case studies. Please get in touch with sales for more info at this time. Thank you, will do so. Also Neoclipse won't start/connect to the database anymore with these amount of data. Am I missing some configurations for neoclipse? Are you getting an error message? No error messages. Is there an option to enable logging? I let neoclipse run for almost an hour and suddenly the graph appeared. But I can not navigate(its like frozen, but there are calculations going on..) Not so sure why it takes so long though, the initial traversal depth is 1, there are 16 nodes and 15 relations. I also decreased the amount of nodes to be displayed to 50. I thought It would load data lazily? Best regards alican Best, David Best regards -- alican -- View this message in context: http://neo4j-community-discussions.438527.n3.nabble.com/Neo4j-performance-with-400million-nodes-tp3467806p3467806.html Sent from the Neo4j Community Discussions mailing list archive at Nabble.com. ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org
Re: [Neo4j] Neo4j performance with 400million nodes
Hi! Also Neoclipse won't start/connect to the database anymore with these amount of data. Am I missing some configurations for neoclipse? Are you getting an error message? No error messages. Is there an option to enable logging? I let neoclipse run for almost an hour and suddenly the graph appeared. But I can not navigate(its like frozen, but there are calculations going on..) Not so sure why it takes so long though, the initial traversal depth is 1, there are 16 nodes and 15 relations. I also decreased the amount of nodes to be displayed to 50. I thought It would load data lazily? If you start Neoclipse from the command line you may see some extra output there. Also, inside the Neoclipse directory there's a workspace directory, and inside that you'll find .metadata/.log Just a thought: how many relationship types are there? /anders ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] Neo4j performance with 400million nodes
Hi Alican, But I am still a bit concerned about the 128GB RAM.. You can run it on less of course. You could run it on your laptop and it would still work. However Neo4j is clever in its use of RAM. The more RAM you can allocate to Neo4j, the more chance that database reads can come straight from memory rather than spending potentially milliseconds going to mechanical disk, yielding thousands of traversals per second rather than millions. So more RAM = less disk hits (statistically) which is where you'll get huge read performance benefits. Less RAM means more likelihood of going to disk. All things being equal, with 128GB RAM you can cache a lot of your dataset in main memory. Perhaps even all your *active* dataset in fact (since it's about a quarter the size of your full dataset). That's going to give you blistering performance. Jim ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] Neo4j performance with 400million nodes
hey anders! Thanks for the pointers. Am 01.11.2011 09:49, schrieb Anders Nawroth: Hi! Also Neoclipse won't start/connect to the database anymore with these amount of data. Am I missing some configurations for neoclipse? Are you getting an error message? No error messages. Is there an option to enable logging? I let neoclipse run for almost an hour and suddenly the graph appeared. But I can not navigate(its like frozen, but there are calculations going on..) Not so sure why it takes so long though, the initial traversal depth is 1, there are 16 nodes and 15 relations. I also decreased the amount of nodes to be displayed to 50. I thought It would load data lazily? If you start Neoclipse from the command line you may see some extra output there. Also, inside the Neoclipse directory there's a workspace directory, and inside that you'll find .metadata/.log There is no such directory (I am using neo4j 1.4.2) (I am aware that it is supposed to be a hidden directory) There was no extra output on the shell, but after another hour I got a java out of mem heap size exception. Which was my first guess anyways. But I just dont see why, since it is supposed to load only the first 16 nodes. All Relation types are in the graph already. - 6 Relation Types all in all. Just a thought: how many relationship types are there? /anders ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user -- alican ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] Neo4j performance with 400million nodes
hey jim! Thanks for the thoughts. I know I could run it on less RAM, it's not a matter of can or cannot. I am also aware that the more RAM the better the performance. But my question is more: how will it perform with less RAM, say 32GB. Every system is quite fast with this amount of RAM. I am not sure if we can convince our customer to invest into 3x128GB RAM. (productive system, staging system, test system) Especially since there is not yet any reference application, which would guarantee an acceptable performance with this kind of system. Am 01.11.2011 10:27, schrieb Jim Webber: Hi Alican, But I am still a bit concerned about the 128GB RAM.. You can run it on less of course. You could run it on your laptop and it would still work. However Neo4j is clever in its use of RAM. The more RAM you can allocate to Neo4j, the more chance that database reads can come straight from memory rather than spending potentially milliseconds going to mechanical disk, yielding thousands of traversals per second rather than millions. So more RAM = less disk hits (statistically) which is where you'll get huge read performance benefits. Less RAM means more likelihood of going to disk. All things being equal, with 128GB RAM you can cache a lot of your dataset in main memory. Perhaps even all your *active* dataset in fact (since it's about a quarter the size of your full dataset). That's going to give you blistering performance. Jim ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user -- alican ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] Neo4j performance with 400million nodes
Hi! If you start Neoclipse from the command line you may see some extra output there. Also, inside the Neoclipse directory there's a workspace directory, and inside that you'll find .metadata/.log There is no such directory (I am using neo4j 1.4.2) (I am aware that it is supposed to be a hidden directory) Ok, then the directory was created in the current directory when you started Neoclipse the first time. It's probably named just workspace. Later versions will always put it in the neoclipse dir (fixed after the 1.4.x cycle, apparently). There was no extra output on the shell, but after another hour I got a java out of mem heap size exception. Which was my first guess anyways. But I just dont see why, since it is supposed to load only the first 16 nodes. All Relation types are in the graph already. - 6 Relation Types all in all. Seems like you hit a bug then. If you send me code to generate a graph like yours, I'll try it out. /anders Just a thought: how many relationship types are there? /anders ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] Neo4j performance with 400million nodes
Alican, we have other customers with that RAM sizes. It is always about the size of your hot data-set (i.e. cached) the better you understand your use-cases, the better you can estimate the # of nodes and rels (and their properties) that have to be cached. Michael Am 01.11.2011 um 10:53 schrieb Alican Gecyasar: hey jim! Thanks for the thoughts. I know I could run it on less RAM, it's not a matter of can or cannot. I am also aware that the more RAM the better the performance. But my question is more: how will it perform with less RAM, say 32GB. Every system is quite fast with this amount of RAM. I am not sure if we can convince our customer to invest into 3x128GB RAM. (productive system, staging system, test system) Especially since there is not yet any reference application, which would guarantee an acceptable performance with this kind of system. Am 01.11.2011 10:27, schrieb Jim Webber: Hi Alican, But I am still a bit concerned about the 128GB RAM.. You can run it on less of course. You could run it on your laptop and it would still work. However Neo4j is clever in its use of RAM. The more RAM you can allocate to Neo4j, the more chance that database reads can come straight from memory rather than spending potentially milliseconds going to mechanical disk, yielding thousands of traversals per second rather than millions. So more RAM = less disk hits (statistically) which is where you'll get huge read performance benefits. Less RAM means more likelihood of going to disk. All things being equal, with 128GB RAM you can cache a lot of your dataset in main memory. Perhaps even all your *active* dataset in fact (since it's about a quarter the size of your full dataset). That's going to give you blistering performance. Jim ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user -- alican ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
[Neo4j] Neo4j performance with 400million nodes
Hello everyone, We are relatively new to neo4j and are evaluating some test scenarios in order to decide to use neo4j in productive systems. We used the latest stable release 1.4.2. I wrote an import script and generated some random data with the given tree structure: http://neo4j-community-discussions.438527.n3.nabble.com/file/n3467806/neo4j_nodes.png Nodes Summary: Nodes with Type A: 1 Nodes with Type B: 100 Nodes with Type C: 50'000 (100x500) Nodes with Type D: 500'000 (50'000x10) Nodes with Type E: 25'000'000 (500'000x50) Nodes with Type F: 375'000'000 (25'000'000x15) This all worked quite OK, the import took approx. 30hours using the batchimport. We have multiple indexes, but we also have one index where all nodes are indexed. My first question would be, does it make sense to index all nodes with the same index? If I would like to list all nodes with property type:type E it is quite slow the first time ~270s Second time it is fast ~1/2s. I know this is normal and mostlikely fixed in the current milestone version. But I am not sure how long the query will be cached in memory. Are there any configurations I should be concerned about? We also took the hardware sizing calculator. See the result here: http://neo4j-community-discussions.438527.n3.nabble.com/file/n3467806/neo4j_hardware.png Are these realistic result values? I guess 128GB RAM and 12TB SSD harddrives might be a bit cost intense. Are there any reference applications with these amount of nodes and relations? Also Neoclipse won't start/connect to the database anymore with these amount of data. Am I missing some configurations for neoclipse? Best regards -- alican -- View this message in context: http://neo4j-community-discussions.438527.n3.nabble.com/Neo4j-performance-with-400million-nodes-tp3467806p3467806.html Sent from the Neo4j Community Discussions mailing list archive at Nabble.com. ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user