Re: Best table storage for analytical use case

2013-03-06 Thread Sékine Coulibaly
Hi Dean, Indeed, switching from RCFiles to SequenceFiles yield a query duration down 35% (82secs down to 53secs) ! I added Snappy/Gzip block compression altogether. Things are getting better, down to 30secs (sequenceFile+snappy). Yes, most request have a WHERE clause with a time range, will have

Re: Best table storage for analytical use case

2013-03-06 Thread Dean Wampler
MapReduce is very course-grained. It might seem that more cores is better, but once the data sizes get well below the block threshold in size, the overhead of starting JVM processes and all the other background becomes a significant percentage of the overall runtime. So, you quickly reach the

Best table storage for analytical use case

2013-03-04 Thread Sékine Coulibaly
Hi there, I've setup a virtual machine hosting Hive. My use case is a Web traffic analytics, hence most of requests are : - how many requests today ? - how many request today, grouped by country ? - most requested urls ? - average http server response time (5 minutes slots) ? In other words,

Re: Best table storage for analytical use case

2013-03-04 Thread Dean Wampler
RCFile won't help much (and apparently not all in this case ;) unless you have a lot of columns and you always query just a few of them. However, you should get better results with Sequence Files (binary format) and usually with a compression scheme like BZip that supports block-level (as opposed