Re: Indexes in Hive

Jörn Franke Tue, 05 Jan 2016 23:50:19 -0800

If I understand you correctly this could be just another Hive storage format.


> On 06 Jan 2016, at 07:24, Mich Talebzadeh <m...@peridale.co.uk> wrote:
> 
> Hi,
> 
> Thinking loudly.
> 
> Ideally we should consider a totally columnar storage offering in which each
> column of table is stored as compressed value (I disregard for now how
> actually ORC does this but obviously it is not exactly a columnar storage).
> 
> So each table can be considered as a loose federation of columnar storage
> and each column is effectively an index?
> 
> As columns are far narrower than tables, each index block will be very
> higher density and all operations like aggregates can be done directly on
> index rather than table. 
> 
> This type of table offering will be in true nature of data warehouse
> storage. Of course row operations (get me all rows for this table) will be
> slower but that is the trade-off that we need to consider.
> 
> Expecting users to write their own IndexHandler may be technically
> interesting but commercially not viable as Hive needs to be a product on its
> own merit not a development base. Writing your own storage attributes etc.
> requires skills that will put off people seeing Hive as an attractive
> proposition (requiring considerable investment in skill sets in order to
> maintain Hive).
> 
> Thus my thinking on this is to offer true columnar storage in Hive to be a
> proper data warehouse. In addition, the development tools cab ne made
> available for those interested in tailoring their own specific Hive
> solutions.
> 
> 
> HTH
> 
> 
> 
> Dr Mich Talebzadeh
> 
> LinkedIn
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUr
> V8Pw
> 
> Sybase ASE 15 Gold Medal Award 2008
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.
> pdf
> Author of the books "A Practitioner's Guide to Upgrading to Sybase ASE 15",
> ISBN 978-0-9563693-0-7. 
> co-author "Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4
> Publications due shortly:
> Complex Event Processing in Heterogeneous Environments, ISBN:
> 978-0-9563693-3-8
> Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume
> one out shortly
> 
> http://talebzadehmich.wordpress.com
> 
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus free,
> therefore neither Peridale Ltd, its subsidiaries nor their employees accept
> any responsibility.
> 
> 
> -----Original Message-----
> From: Gopal Vijayaraghavan [mailto:go...@hortonworks.com] On Behalf Of Gopal
> Vijayaraghavan
> Sent: 05 January 2016 23:55
> To: user@hive.apache.org
> Subject: Re: Is Hive Index officially not recommended?
> 
> 
>> So in a nutshell in Hive if "external" indexes are not used for 
>> improving query response, what value they add and can we forget them for
> now?
> 
> The builtin indexes - those that write data as smaller tables are only
> useful in a pre-columnar world, where the indexes offer a huge reduction in
> IO.
> 
> Part #1 of using hive indexes effectively is to write your own
> HiveIndexHandler, with usesIndexTable=false;
> 
> And then write a IndexPredicateAnalyzer, which lets you map arbitrary
> lookups into other range conditions.
> 
> Not coincidentally - we're adding a "ANALYZE TABLE ... CACHE METADATA"
> which consolidates the "internal" index into an external store (HBase).
> 
> Some of the index data now lives in the HBase metastore, so that the
> inclusion/exclusion of whole partitions can be done off the consolidated
> index. 
> 
> https://issues.apache.org/jira/browse/HIVE-11676
> 
> 
> The experience from BI workloads run by customers is that in general, the
> lookup to the right "slice" of data is more of a problem than the actual
> aggregate.
> 
> And that for a workhorse data warehouse, this has to survive even if there's
> a non-stop stream of updates into it.
> 
> Cheers,
> Gopal
>

Re: Indexes in Hive

Reply via email to