Re: Best practices for storing data on Hive

Edward Capriolo Thu, 08 Sep 2011 18:26:41 -0700

On Thu, Sep 8, 2011 at 8:30 PM, Steven Wong <sw...@netflix.com> wrote:


> I think this statement is not true: "By distributing by (and preferably
> ordering by) user_id, we can minimize seek time in the table because Hive
> knows where all entries pertaining to a specific user are stored." I think
> it is not true whether the table is bucketed on user_id or not (assuming
> that user_id is not a partition column or indexed column).
>
>
> -----Original Message-----
> From: Mark Grover [mailto:mgro...@oanda.com]
> Sent: Tuesday, September 06, 2011 2:36 PM
> To: user@hive.apache.org
> Cc: Travis Powell; Baiju Devani; Bob Tiernay
> Subject: Re: Best practices for storing data on Hive
>
> Thanks for your reply, Travis.
>
> I was under the impression that for Hive to make use of sorted structure
> of data (i.e. for the table named "data" in your example), the metadata
> of the table (specified during table creation) has to advertise such
> property. However, I don't see any special metadata specifying such
> property when "data" table was created.
>
> Is that true? If so, is such metadata specified by using CLUSTERED BY
> and SORTED BY clauses during table creation?
>
> On 11-09-06 03:50 PM, Travis Powell wrote:
> > Hi Mark,
> >
> > When we load data into Hive, we use a staging table to dynamically
> partition our data. This might help you too.
> >
> > We create our initial table and our staging table:
> >
> > DROP TABLE IF EXISTS staging_data;
> > CREATE TABLE staging_data ( ... )
> >       ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
> > CREATE TABLE data ( ... )
> >       PARTITIONED BY (dt STRING, hour, INT)
> >       ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS
> SEQUENCEFILE;
> >
> > INSERT OVERWRITE TABLE data PARTITION(dt, hour) SELECT q.*,
> to_date(q.session_timestamp) AS dt, hour(q.session_timestamp) AS hour FROM
> staging_session q ORDER BY user_id DISTRIBUTE BY user_id;
> >
> >
> > So.....
> > By distributing by (and preferably ordering by) user_id, we can minimize
> seek time in the table because Hive knows where all entries pertaining to a
> specific user are stored. Partitions by time have the best performance,
> because chances are almost every query will have some time-related component
> in it (and it spreads out the data among partitions fairly well.)
> >
> > Let me know if this works for you. We start every job with those first
> few lines of Hive script. It works well for us.
> >
> > Thanks,
> >
> > Travis Powell
> >
> > -----Original Message-----
> > From: Mark Grover [mailto:mgro...@oanda.com]
> > Sent: Tuesday, September 06, 2011 12:39 PM
> > To: user@hive.apache.org
> > Cc: wd; Bob Tiernay; Baiju Devani
> > Subject: Re: Best practices for storing data on Hive
> >
> > Thanks for the response, wd.
> >
> > I would REALLY APPRECIATE if other people can share their views as well.
> >
> > Here are the possible solutions that I have thought about to the problem
> > (see original email for description of problem):
> >
> > 1) Multiple partitions: We would partition the table by day and userId.
> > However, given the amount of users that visit our website (hundreds of
> > thousands of unique users every day), this would lead to a large number
> > of partitions (and rather small file sizes, ranging from a couple of
> > bytes to a couple of KB). From the documentation I've read online, it
> > seems that Hive/Hadoop weren't designed to deal with such small file
> > sizes and such a situation should be avoided if possible.
> > We had a scenario previously where we were partitioning by day and hour
> > and because of the sheer number of partitions queries like "select *
> > from<table>  LIMIT 1;" were taking very long and even failed because of
> > "Java out of Heap space" errors. My guess is that the master node was
> > munching through all these partitions and couldn't deal with the large
> > number of partitions.
> >
> > 2) Use of data locality: We could keep the data partitioned by day and
> > bucketed by userId. Within each bucket sort the data by the (userId,
> > time). This way we could keep the data related to each userId together
> > within a daily partition and if Hive could be made aware of this sorting
> > order and could make use of this order to improve search/query times,
> > that would alleviate the problem quite a bit. The big question here is:
> > Does Hive leverage sorting order of data within a partition bucket when
> > running (most/all?) queries, where possible?
> >
> >
> > 3) Using an index: As wd mentioned, Hive 0.7 introduces the notion on an
> > index. If I do index on userId, given that we can hundreds of thousands
> > of unique users per day, would indexing prove to be a good move? Are
> > there people who are using it for similar purposes or on a similar scale?
> >
> >
> > 4) Using 2 "orthogonal tables": As mentioned in my original email (see
> > below), we could have 2 independent tables, one which stores data
> > partitioned by day and other partitioned by userId. For maintaining
> > partitions in userId partitioned table, I am planning to do the
> following:
> > In the nightly job, if userId=X visited the website previous day, we
> > create a partition for userId=X if it doesn't already exist. Once the
> > partition is created, all clicks for that user Id on the day for in
> > question are put in a single file and dropped in the userId=X folder on
> > HDFS. This method could be used to simulate an "append" to the Hive
> > table. The file would only be a few bytes to a few KB and the format of
> > the table would be sequence file.
> >
> > What are your thoughts about the above 4 methods? Any particular likes
> > or dislikes? Any comments, suggestions would be helpful.
> >
> > Thank you again in advance!
> >
> > Mark
> >
> > On 11-09-04 04:01 AM, wd wrote:
> >> Hive support more than one partitions, have your tried? Maybe you can
> >> create to partitions named as date and user.
> >>
> >> Hive 0.7 also support index, maybe you can have a try.
> >>
> >> On Sat, Sep 3, 2011 at 1:18 AM, Mark Grover<mgro...@oanda.com>   wrote:
> >>> Hello folks,
> >>> I am fairly new to Hive and am wondering if you could share some of the
> best practices for storing/querying data with Hive.
> >>>
> >>> Here is an example of the problem I am trying to solve.
> >>>
> >>> The traffic to our website is logged in files that contain information
> about clicks from various users.
> >>> Simplified, the log file looks like:
> >>> t_1, ip_1, userid_1
> >>> t_2, ip_2, userid_2
> >>> t_3, ip_3, userid_3
> >>> ...
> >>>
> >>> where t_i represents time of the click, ip_i represents ip address
> where the click originated from, and userid_i represents the user ID of the
> user.
> >>>
> >>> Since the clicks are logged on an ongoing basis, partitioning our Hive
> table by day seemed like the obvious choice. Every night we upload the data
> from the previous day into a new partition.
> >>>
> >>> However, we would also want the capability to find all log lines
> corresponding to a particular user. With our present partitioning scheme,
> all day partitions are searched for that user ID but this takes a long time.
> I am looking for ideas/suggestions/thoughts/comments on how to reduce this
> time.
> >>>
> >>> As a solution, I am thinking that perhaps we could have 2 independent
> tables, one which stores data partitioned by day and the other partitioned
> by userId. With the second table partitioned by userId, I will have to find
> some way of maintaining the partitions since Hive doesn't support appending
> of files. Also, this seems suboptimal, since we are doubling that the amount
> of data that we store. What do you folks think of this idea?
> >>>
> >>> Do you have any other suggestions on how we can approach this problem?
> >>>
> >>> What have other people in similar situations done? Please share.
> >>>
> >>> Thank you in advance!
> >>> Mark
> >>>
>
> --
> Mark Grover, Business Intelligence Analyst
> OANDA Corporation
>
> www: oanda.com www: fxtrade.com
> e: mgro...@oanda.com
>
> "Best Trading Platform" - World Finance's Forex Awards 2009.
> "The One to Watch" - Treasury Today's Adam Smith Awards 2009.
>
>
>
Hive does not optimize when the table is defined with ordereded by. The only
way to optimize is to use partitions, bucketing, or indexes.

Re: Best practices for storing data on Hive

Reply via email to