Hi there, I've setup a virtual machine hosting Hive. My use case is a Web traffic analytics, hence most of requests are :
- how many requests today ? - how many request today, grouped by country ? - most requested urls ? - average http server response time (5 minutes slots) ? In other words, lets consider : CREATE TABLE logs ( url STRING, orig_country STRING, http_rt INT ) and SELECT COUNT(*) FROM logs; SELECT COUNT(*),orig_country FROM logs GROUP BY orig_country; SELECT COUNT(*),url FROM logs BROUP BY url; SELECT AVG(http_rt) FROM logs ... 2 questions here : - How to generate 5 minutes slots to make my averages (in Postgresql, I used to generate_series() and JOIN) ? I wish I could avoid doing multiple requests each with a 'WHERE date>... AND date <...'. Maybe a mapper, mapping the date string to a aslot number ? - What is the best storage method pour this table ? Since it's purpose is analytical, I thought columnar format was the way to go. So I tried RCFILE buy the results are as follow for around 1 million rows (quite small, I know) and are quite the opposite I was expecting : Storage / query duration / disk table size TEXTFILE / 22 seconds / 250MB RCFILE / 31 seconds / 320 MB I thought getting values in columns would speed up the aggregate process. Maybe the dataset is too small to tell, or I missed something ? Will adding Snappy compression help (not sure whether RCFiles are compressed or not) ? Thank you !