----- Original Message ----- From: Doug Meil <[email protected]> Sent: Thu Jul 14 2011 22:29:16 GMT+0200 (CET) To: CC: Subject: Re: data structure
Hi there- A few high-level suggestions... re: "to generate a report: for example we want to know how many impressions were done by all users in last x days" Can you create a summary table by day (via MR job), and then have your ad-hoc report hit the summary table? Re: "and with the data growing, the time will increase" Yes. As you add more and more data processing times will slow. That's why you need to expect to periodically expand your cluster.
i guess a summary table will be it the only disadvantage of such tables is, that its not that flexible so ie if i store data for every hour (24 entries a day), i can run fast reports for special time ranges, ie 12:00 to 15:00 but there is no way to generate a report for the time range 12:30 to 13:45 i guess we will live with that constraint i thought, the hadoop+hbase+mapreduce is such a cool magic stuff, there is not need for summary tables, and data scans are running withins milliseconds... ;-)
