Hive support more than one partitions, have your tried? Maybe you can create to partitions named as date and user.
Hive 0.7 also support index, maybe you can have a try. On Sat, Sep 3, 2011 at 1:18 AM, Mark Grover <mgro...@oanda.com> wrote: > Hello folks, > I am fairly new to Hive and am wondering if you could share some of the best > practices for storing/querying data with Hive. > > Here is an example of the problem I am trying to solve. > > The traffic to our website is logged in files that contain information about > clicks from various users. > Simplified, the log file looks like: > t_1, ip_1, userid_1 > t_2, ip_2, userid_2 > t_3, ip_3, userid_3 > ... > > where t_i represents time of the click, ip_i represents ip address where the > click originated from, and userid_i represents the user ID of the user. > > Since the clicks are logged on an ongoing basis, partitioning our Hive table > by day seemed like the obvious choice. Every night we upload the data from > the previous day into a new partition. > > However, we would also want the capability to find all log lines > corresponding to a particular user. With our present partitioning scheme, all > day partitions are searched for that user ID but this takes a long time. I am > looking for ideas/suggestions/thoughts/comments on how to reduce this time. > > As a solution, I am thinking that perhaps we could have 2 independent tables, > one which stores data partitioned by day and the other partitioned by userId. > With the second table partitioned by userId, I will have to find some way of > maintaining the partitions since Hive doesn't support appending of files. > Also, this seems suboptimal, since we are doubling that the amount of data > that we store. What do you folks think of this idea? > > Do you have any other suggestions on how we can approach this problem? > > What have other people in similar situations done? Please share. > > Thank you in advance! > Mark >