Hi List,

I'm currently investigating how we can improve our storage of our 
website's pageviews. The model described in the wiki seems to be be a 
bit too simple for our case, but is basically how I'd like to do it.

We serve about 1 billion pageviews yearly, so that's already 3 billion 
primitives for the pageviews alone. The visits themselves (paths in the 
example) will be another 100 million nodes or so plus various relations 
and/or properties.

To efficiently retrieve aggregated summaries from that dataset (i.e. 
which article got the most views on date D) I probably need to introduce 
summaries of data (article X got N pageviews on 2010/7/24) and/or 
partition the pageviews per date-period. And those summaries probably 
need to be linked to some "date-tree" (i.e. year -> month -> day) for 
efficient querying when no specific articles are requested.

Unfortunately, on a normal day, we get pageviews to about 70k different 
articles/items, so such per-article daily partition-nodes would add 
another 25 million nodes with 2 relations each. And hourly summary nodes 
would of course be even more.

All in all, there'll probably be some 1.2 billion nodes and 3.5 billion 
relations for a single year (and mostly properties on the less frequent 
items).

As there are no hard performance requirements (a few seconds or even 
minutes to answer a query is acceptable) Neo4j seems up to the task of 
storing a few years worth of data on a single machine. Although the 
server(s) storing that would probably still require quite a bit of 
memory and preferably ssd-storage.

I'm wondering how much on-disk storage such a set-up would roughly 
require. Is the size per node (9 bytes) and relationship (33 bytes) from 
the configuration-page in the wiki valid for this much data as well? In 
that case it would be about 10GB for the nodes and 110GB for the 
relationships?

How much storage would be required for properties, for instance for a 
short, int or long? I'm wondering whether its a good idea to prevent any 
properties on the pageview-nodes as the fact that a pageview-node is 
linked to some hourly or daily partition-node tells it what date it 
happened, but it may storagewise be a better idea to store a numeric 
value in a property representing the actual time (perhaps a short to 
represent the hour of the month) and thus saving some relations and nodes.

Best regards,

Arjen
_______________________________________________
Neo4j mailing list
[email protected]
https://lists.neo4j.org/mailman/listinfo/user

Reply via email to