Can't resist teasing Mich about this:  "Indeed one often demoralises data
taking advantages of massive parallel processing in Hive."

Surely he meant denormalizes <https://en.wikipedia.org/wiki/Denormalization>.
Nobody would want to demoralise their data -- performance would suffer.  ;)

-- Lefty


On Mon, Feb 1, 2016 at 10:00 AM, Mich Talebzadeh <m...@peridale.co.uk>
wrote:

> Thanks Alan for this explanation. Interesting to see Primary Key in Hive.
>
>
>
>
>
> Sometimes comparison is made between Hive Storage Index concept in Orc and
> Oracle Exadata  storage index that also uses the same terminology!
>
>
>
> It is a bit of a misnomer to call Oracle Exadata indexes a “storage
> index”, since it appears that Exadata stores data block from tables in the
> storage index, usually when they are accessed via a full-table scan.  In
> this context Exadata storage index is not a “real” index in the sense that
> the storage index exists only in RAM, and it must be re-created from
> scratch when the Exadata server is bounced.
>
>
>
> Oracle Exadata  and SAP HANA as far as I know force serial scans into
> Hardware - with HANA, it is by pushing the bitmaps into the L2 cache on the
> chip - Oracle has special processors on SPARC T5 called D???? <something>
> that offloads the column bit scan off the CPU and onto separate specialized
> HW.  As a result, both rely on massive parallelization..
>
>
>
>
>
> Orc storage index is neat and different from both Exadata and SAP HANA,
> The way I see ORC storage indexes
>
>
>
> ·         They are combined Index and statistics.
>
> ·         Each index has statistics of min, max, count, and sum for each
> column in the row group of 10,000 rows.
>
> ·         Crucially, it has the location of the start of each row group,
> so that the query can jump straight to the beginning of the row group.
>
> ·         The query can do  a SARG pushdown that limits which rows are
> required for the query and can avoid reading an entire file, or at least
> sections of the file which is by and large what a conventional RDBMS B-tree
> index does.
>
>
>
>
>
> Cheers,
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Technology Ltd, its subsidiaries nor their
> employees accept any responsibility.
>
>
>
> *From:* Alan Gates [mailto:alanfga...@gmail.com]
> *Sent:* 01 February 2016 17:07
> *To:* user@hive.apache.org
> *Subject:* Re: ORC format
>
>
>
> ORC does not currently expose a primary key to the user, though we have
> talked of having it do that.  As Mich says the indexing on ORC is oriented
> towards statistics that help the optimizer plan the query.  This can be
> very important in split generation (determining which parts of the input
> will be read by which tasks) as well as on the fly input pruning (deciding
> not to read a section of the file because the stats show that no rows in
> that section will match a predicate).  Either of these can help joins.  But
> as there is not a user visible primary key there's no ability to rewrite
> the join as an index based join, which I think is what you were asking
> about in your original email.
>
> Alan.
>
>
> *Philip Lee* <philjj...@gmail.com>
>
> February 1, 2016 at 7:27
>
> Also,
>
> when making ORC from CSV,
>
> for indexing every key on each coulmn is made, or a primary on a table is
> made ?
>
>
>
> If keys are made on each column in a table, accessing any column in some
> functions like filtering should be faster.
>
>
>
>
>
>
> --
>
> ==========================================================
>
> *Hae Joon Lee*
>
>
>
> Now, in Germany,
>
> M.S. Candidate, Interested in Distributed System, Iterative Processing
>
> Dept. of Computer Science, Informatik in German, TUB
>
> Technical University of Berlin
>
>
>
> In Korea,
>
> M.S. Candidate, Computer Architecture Laboratory
>
> Dept. of Computer Science, KAIST
>
>
>
> Rm# 4414 CS Dept. KAIST
>
> 373-1 Guseong-dong, Yuseong-gu, Daejon, South Korea (305-701)
>
>
>
> Mobile) 49) 015-251-448-278 in Germany, no cellular in Korea
>
> ==========================================================
>
> *Philip Lee* <philjj...@gmail.com>
>
> February 1, 2016 at 7:21
>
> Hello,
>
>
>
> I experiment the performance of some systems between ORC and CSV file.
>
> I read about ORC documentation on Hive website, but still curious of some
> things.
>
>
>
> I know ORC format is faster on filtering or reading because it has
> indexing.
>
> Has it advantage of joining two tables of ORC dataset as well?
>
>
>
> Could you explain about it in detail?
>
> When experimenting, it seems like it has some advantages of joining in
> some aspect, but not quite sure what characteristic of ORC make this
> happening rather than CSV.
>
>
>
> Best,
>
> Phil
>
>
>
>

Reply via email to