On Thu, Sep 8, 2011 at 8:30 PM, Steven Wong <sw...@netflix.com> wrote:
> I think this statement is not true: "By distributing by (and preferably > ordering by) user_id, we can minimize seek time in the table because Hive > knows where all entries pertaining to a specific user are stored." I think > it is not true whether the table is bucketed on user_id or not (assuming > that user_id is not a partition column or indexed column). > > > -----Original Message----- > From: Mark Grover [mailto:mgro...@oanda.com] > Sent: Tuesday, September 06, 2011 2:36 PM > To: user@hive.apache.org > Cc: Travis Powell; Baiju Devani; Bob Tiernay > Subject: Re: Best practices for storing data on Hive > > Thanks for your reply, Travis. > > I was under the impression that for Hive to make use of sorted structure > of data (i.e. for the table named "data" in your example), the metadata > of the table (specified during table creation) has to advertise such > property. However, I don't see any special metadata specifying such > property when "data" table was created. > > Is that true? If so, is such metadata specified by using CLUSTERED BY > and SORTED BY clauses during table creation? > > On 11-09-06 03:50 PM, Travis Powell wrote: > > Hi Mark, > > > > When we load data into Hive, we use a staging table to dynamically > partition our data. This might help you too. > > > > We create our initial table and our staging table: > > > > DROP TABLE IF EXISTS staging_data; > > CREATE TABLE staging_data ( ... ) > > ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; > > CREATE TABLE data ( ... ) > > PARTITIONED BY (dt STRING, hour, INT) > > ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS > SEQUENCEFILE; > > > > INSERT OVERWRITE TABLE data PARTITION(dt, hour) SELECT q.*, > to_date(q.session_timestamp) AS dt, hour(q.session_timestamp) AS hour FROM > staging_session q ORDER BY user_id DISTRIBUTE BY user_id; > > > > > > So..... > > By distributing by (and preferably ordering by) user_id, we can minimize > seek time in the table because Hive knows where all entries pertaining to a > specific user are stored. Partitions by time have the best performance, > because chances are almost every query will have some time-related component > in it (and it spreads out the data among partitions fairly well.) > > > > Let me know if this works for you. We start every job with those first > few lines of Hive script. It works well for us. > > > > Thanks, > > > > Travis Powell > > > > -----Original Message----- > > From: Mark Grover [mailto:mgro...@oanda.com] > > Sent: Tuesday, September 06, 2011 12:39 PM > > To: user@hive.apache.org > > Cc: wd; Bob Tiernay; Baiju Devani > > Subject: Re: Best practices for storing data on Hive > > > > Thanks for the response, wd. > > > > I would REALLY APPRECIATE if other people can share their views as well. > > > > Here are the possible solutions that I have thought about to the problem > > (see original email for description of problem): > > > > 1) Multiple partitions: We would partition the table by day and userId. > > However, given the amount of users that visit our website (hundreds of > > thousands of unique users every day), this would lead to a large number > > of partitions (and rather small file sizes, ranging from a couple of > > bytes to a couple of KB). From the documentation I've read online, it > > seems that Hive/Hadoop weren't designed to deal with such small file > > sizes and such a situation should be avoided if possible. > > We had a scenario previously where we were partitioning by day and hour > > and because of the sheer number of partitions queries like "select * > > from<table> LIMIT 1;" were taking very long and even failed because of > > "Java out of Heap space" errors. My guess is that the master node was > > munching through all these partitions and couldn't deal with the large > > number of partitions. > > > > 2) Use of data locality: We could keep the data partitioned by day and > > bucketed by userId. Within each bucket sort the data by the (userId, > > time). This way we could keep the data related to each userId together > > within a daily partition and if Hive could be made aware of this sorting > > order and could make use of this order to improve search/query times, > > that would alleviate the problem quite a bit. The big question here is: > > Does Hive leverage sorting order of data within a partition bucket when > > running (most/all?) queries, where possible? > > > > > > 3) Using an index: As wd mentioned, Hive 0.7 introduces the notion on an > > index. If I do index on userId, given that we can hundreds of thousands > > of unique users per day, would indexing prove to be a good move? Are > > there people who are using it for similar purposes or on a similar scale? > > > > > > 4) Using 2 "orthogonal tables": As mentioned in my original email (see > > below), we could have 2 independent tables, one which stores data > > partitioned by day and other partitioned by userId. For maintaining > > partitions in userId partitioned table, I am planning to do the > following: > > In the nightly job, if userId=X visited the website previous day, we > > create a partition for userId=X if it doesn't already exist. Once the > > partition is created, all clicks for that user Id on the day for in > > question are put in a single file and dropped in the userId=X folder on > > HDFS. This method could be used to simulate an "append" to the Hive > > table. The file would only be a few bytes to a few KB and the format of > > the table would be sequence file. > > > > What are your thoughts about the above 4 methods? Any particular likes > > or dislikes? Any comments, suggestions would be helpful. > > > > Thank you again in advance! > > > > Mark > > > > On 11-09-04 04:01 AM, wd wrote: > >> Hive support more than one partitions, have your tried? Maybe you can > >> create to partitions named as date and user. > >> > >> Hive 0.7 also support index, maybe you can have a try. > >> > >> On Sat, Sep 3, 2011 at 1:18 AM, Mark Grover<mgro...@oanda.com> wrote: > >>> Hello folks, > >>> I am fairly new to Hive and am wondering if you could share some of the > best practices for storing/querying data with Hive. > >>> > >>> Here is an example of the problem I am trying to solve. > >>> > >>> The traffic to our website is logged in files that contain information > about clicks from various users. > >>> Simplified, the log file looks like: > >>> t_1, ip_1, userid_1 > >>> t_2, ip_2, userid_2 > >>> t_3, ip_3, userid_3 > >>> ... > >>> > >>> where t_i represents time of the click, ip_i represents ip address > where the click originated from, and userid_i represents the user ID of the > user. > >>> > >>> Since the clicks are logged on an ongoing basis, partitioning our Hive > table by day seemed like the obvious choice. Every night we upload the data > from the previous day into a new partition. > >>> > >>> However, we would also want the capability to find all log lines > corresponding to a particular user. With our present partitioning scheme, > all day partitions are searched for that user ID but this takes a long time. > I am looking for ideas/suggestions/thoughts/comments on how to reduce this > time. > >>> > >>> As a solution, I am thinking that perhaps we could have 2 independent > tables, one which stores data partitioned by day and the other partitioned > by userId. With the second table partitioned by userId, I will have to find > some way of maintaining the partitions since Hive doesn't support appending > of files. Also, this seems suboptimal, since we are doubling that the amount > of data that we store. What do you folks think of this idea? > >>> > >>> Do you have any other suggestions on how we can approach this problem? > >>> > >>> What have other people in similar situations done? Please share. > >>> > >>> Thank you in advance! > >>> Mark > >>> > > -- > Mark Grover, Business Intelligence Analyst > OANDA Corporation > > www: oanda.com www: fxtrade.com > e: mgro...@oanda.com > > "Best Trading Platform" - World Finance's Forex Awards 2009. > "The One to Watch" - Treasury Today's Adam Smith Awards 2009. > > > Hive does not optimize when the table is defined with ordereded by. The only way to optimize is to use partitions, bucketing, or indexes.