Re: Best practices for storing data on Hive

wd Sun, 04 Sep 2011 01:02:33 -0700

Hive support more than one partitions, have your tried? Maybe you can
create to partitions named as date and user.


Hive 0.7 also support index, maybe you can have a try.

On Sat, Sep 3, 2011 at 1:18 AM, Mark Grover <mgro...@oanda.com> wrote:
> Hello folks,
> I am fairly new to Hive and am wondering if you could share some of the best 
> practices for storing/querying data with Hive.
>
> Here is an example of the problem I am trying to solve.
>
> The traffic to our website is logged in files that contain information about 
> clicks from various users.
> Simplified, the log file looks like:
> t_1, ip_1, userid_1
> t_2, ip_2, userid_2
> t_3, ip_3, userid_3
> ...
>
> where t_i represents time of the click, ip_i represents ip address where the 
> click originated from, and userid_i represents the user ID of the user.
>
> Since the clicks are logged on an ongoing basis, partitioning our Hive table 
> by day seemed like the obvious choice. Every night we upload the data from 
> the previous day into a new partition.
>
> However, we would also want the capability to find all log lines 
> corresponding to a particular user. With our present partitioning scheme, all 
> day partitions are searched for that user ID but this takes a long time. I am 
> looking for ideas/suggestions/thoughts/comments on how to reduce this time.
>
> As a solution, I am thinking that perhaps we could have 2 independent tables, 
> one which stores data partitioned by day and the other partitioned by userId. 
> With the second table partitioned by userId, I will have to find some way of 
> maintaining the partitions since Hive doesn't support appending of files. 
> Also, this seems suboptimal, since we are doubling that the amount of data 
> that we store. What do you folks think of this idea?
>
> Do you have any other suggestions on how we can approach this problem?
>
> What have other people in similar situations done? Please share.
>
> Thank you in advance!
> Mark
>

Re: Best practices for storing data on Hive

Reply via email to