Hi List, I'm currently investigating how we can improve our storage of our website's pageviews. The model described in the wiki seems to be be a bit too simple for our case, but is basically how I'd like to do it.
We serve about 1 billion pageviews yearly, so that's already 3 billion primitives for the pageviews alone. The visits themselves (paths in the example) will be another 100 million nodes or so plus various relations and/or properties. To efficiently retrieve aggregated summaries from that dataset (i.e. which article got the most views on date D) I probably need to introduce summaries of data (article X got N pageviews on 2010/7/24) and/or partition the pageviews per date-period. And those summaries probably need to be linked to some "date-tree" (i.e. year -> month -> day) for efficient querying when no specific articles are requested. Unfortunately, on a normal day, we get pageviews to about 70k different articles/items, so such per-article daily partition-nodes would add another 25 million nodes with 2 relations each. And hourly summary nodes would of course be even more. All in all, there'll probably be some 1.2 billion nodes and 3.5 billion relations for a single year (and mostly properties on the less frequent items). As there are no hard performance requirements (a few seconds or even minutes to answer a query is acceptable) Neo4j seems up to the task of storing a few years worth of data on a single machine. Although the server(s) storing that would probably still require quite a bit of memory and preferably ssd-storage. I'm wondering how much on-disk storage such a set-up would roughly require. Is the size per node (9 bytes) and relationship (33 bytes) from the configuration-page in the wiki valid for this much data as well? In that case it would be about 10GB for the nodes and 110GB for the relationships? How much storage would be required for properties, for instance for a short, int or long? I'm wondering whether its a good idea to prevent any properties on the pageview-nodes as the fact that a pageview-node is linked to some hourly or daily partition-node tells it what date it happened, but it may storagewise be a better idea to store a numeric value in a property representing the actual time (perhaps a short to represent the hour of the month) and thus saving some relations and nodes. Best regards, Arjen _______________________________________________ Neo4j mailing list [email protected] https://lists.neo4j.org/mailman/listinfo/user

