RE: Best practices for storing data on Hive

Aggarwal, Vaibhav Tue, 06 Sep 2011 13:15:38 -0700

>However, given the amount of users that visit our website (hundreds of 
>thousands of unique users every day), this would lead to a large number of 
>partitions (and rather small file sizes, ranging from a >couple of bytes to a 
>couple of KB). From the documentation I've read online, it seems that 
>Hive/Hadoop weren't designed to deal with such small file sizes and such a 
>situation should be avoided if >possible.


Hive uses CombineHiveInputFormat which can be used to combine splits across 
files (if they are not compressed). The new version of Hive (0.7.1) also 
combines splits across partition boundaries.
This might be worth an experiment.

>We had a scenario previously where we were partitioning by day and hour and 
>because of the sheer number of partitions queries like "select * from <table> 
>LIMIT 1;" were taking very long and even >failed because of "Java out of Heap 
>space" errors. My guess is that the master node was munching through all these 
>partitions and couldn't deal with the large number of partitions.

This is a real issue. In case you don't see an explosive growth in number of 
partitions, you can choose to increase the HADOOP_HEAPSIZE on just the master 
node.

Thanks
Vaibhav

On 11-09-04 04:01 AM, wd wrote:
> Hive support more than one partitions, have your tried? Maybe you can 
> create to partitions named as date and user.
>
> Hive 0.7 also support index, maybe you can have a try.
>
> On Sat, Sep 3, 2011 at 1:18 AM, Mark Grover<[email protected]>  wrote:
>> Hello folks,
>> I am fairly new to Hive and am wondering if you could share some of the best 
>> practices for storing/querying data with Hive.
>>
>> Here is an example of the problem I am trying to solve.
>>
>> The traffic to our website is logged in files that contain information about 
>> clicks from various users.
>> Simplified, the log file looks like:
>> t_1, ip_1, userid_1
>> t_2, ip_2, userid_2
>> t_3, ip_3, userid_3
>> ...
>>
>> where t_i represents time of the click, ip_i represents ip address where the 
>> click originated from, and userid_i represents the user ID of the user.
>>
>> Since the clicks are logged on an ongoing basis, partitioning our Hive table 
>> by day seemed like the obvious choice. Every night we upload the data from 
>> the previous day into a new partition.
>>
>> However, we would also want the capability to find all log lines 
>> corresponding to a particular user. With our present partitioning scheme, 
>> all day partitions are searched for that user ID but this takes a long time. 
>> I am looking for ideas/suggestions/thoughts/comments on how to reduce this 
>> time.
>>
>> As a solution, I am thinking that perhaps we could have 2 independent 
>> tables, one which stores data partitioned by day and the other partitioned 
>> by userId. With the second table partitioned by userId, I will have to find 
>> some way of maintaining the partitions since Hive doesn't support appending 
>> of files. Also, this seems suboptimal, since we are doubling that the amount 
>> of data that we store. What do you folks think of this idea?
>>
>> Do you have any other suggestions on how we can approach this problem?
>>
>> What have other people in similar situations done? Please share.
>>
>> Thank you in advance!
>> Mark
>>

--
Mark Grover, Business Intelligence Analyst OANDA Corporation

www: oanda.com www: fxtrade.com
e: [email protected]

"Best Trading Platform" - World Finance's Forex Awards 2009.
"The One to Watch" - Treasury Today's Adam Smith Awards 2009.

RE: Best practices for storing data on Hive

Reply via email to