Can't resist teasing Mich about this: "Indeed one often demoralises data taking advantages of massive parallel processing in Hive."
Surely he meant denormalizes <https://en.wikipedia.org/wiki/Denormalization>. Nobody would want to demoralise their data -- performance would suffer. ;) -- Lefty On Mon, Feb 1, 2016 at 10:00 AM, Mich Talebzadeh <m...@peridale.co.uk> wrote: > Thanks Alan for this explanation. Interesting to see Primary Key in Hive. > > > > > > Sometimes comparison is made between Hive Storage Index concept in Orc and > Oracle Exadata storage index that also uses the same terminology! > > > > It is a bit of a misnomer to call Oracle Exadata indexes a “storage > index”, since it appears that Exadata stores data block from tables in the > storage index, usually when they are accessed via a full-table scan. In > this context Exadata storage index is not a “real” index in the sense that > the storage index exists only in RAM, and it must be re-created from > scratch when the Exadata server is bounced. > > > > Oracle Exadata and SAP HANA as far as I know force serial scans into > Hardware - with HANA, it is by pushing the bitmaps into the L2 cache on the > chip - Oracle has special processors on SPARC T5 called D???? <something> > that offloads the column bit scan off the CPU and onto separate specialized > HW. As a result, both rely on massive parallelization.. > > > > > > Orc storage index is neat and different from both Exadata and SAP HANA, > The way I see ORC storage indexes > > > > · They are combined Index and statistics. > > · Each index has statistics of min, max, count, and sum for each > column in the row group of 10,000 rows. > > · Crucially, it has the location of the start of each row group, > so that the query can jump straight to the beginning of the row group. > > · The query can do a SARG pushdown that limits which rows are > required for the query and can avoid reading an entire file, or at least > sections of the file which is by and large what a conventional RDBMS B-tree > index does. > > > > > > Cheers, > > > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > *Sybase ASE 15 Gold Medal Award 2008* > > A Winning Strategy: Running the most Critical Financial Data on ASE 15 > > > http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf > > Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE > 15", ISBN 978-0-9563693-0-7*. > > co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN > 978-0-9759693-0-4* > > *Publications due shortly:* > > *Complex Event Processing in Heterogeneous Environments*, ISBN: > 978-0-9563693-3-8 > > *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume > one out shortly > > > > http://talebzadehmich.wordpress.com > > > > NOTE: The information in this email is proprietary and confidential. This > message is for the designated recipient only, if you are not the intended > recipient, you should destroy it immediately. Any information in this > message shall not be understood as given or endorsed by Peridale Technology > Ltd, its subsidiaries or their employees, unless expressly so stated. It is > the responsibility of the recipient to ensure that this email is virus > free, therefore neither Peridale Technology Ltd, its subsidiaries nor their > employees accept any responsibility. > > > > *From:* Alan Gates [mailto:alanfga...@gmail.com] > *Sent:* 01 February 2016 17:07 > *To:* user@hive.apache.org > *Subject:* Re: ORC format > > > > ORC does not currently expose a primary key to the user, though we have > talked of having it do that. As Mich says the indexing on ORC is oriented > towards statistics that help the optimizer plan the query. This can be > very important in split generation (determining which parts of the input > will be read by which tasks) as well as on the fly input pruning (deciding > not to read a section of the file because the stats show that no rows in > that section will match a predicate). Either of these can help joins. But > as there is not a user visible primary key there's no ability to rewrite > the join as an index based join, which I think is what you were asking > about in your original email. > > Alan. > > > *Philip Lee* <philjj...@gmail.com> > > February 1, 2016 at 7:27 > > Also, > > when making ORC from CSV, > > for indexing every key on each coulmn is made, or a primary on a table is > made ? > > > > If keys are made on each column in a table, accessing any column in some > functions like filtering should be faster. > > > > > > > -- > > ========================================================== > > *Hae Joon Lee* > > > > Now, in Germany, > > M.S. Candidate, Interested in Distributed System, Iterative Processing > > Dept. of Computer Science, Informatik in German, TUB > > Technical University of Berlin > > > > In Korea, > > M.S. Candidate, Computer Architecture Laboratory > > Dept. of Computer Science, KAIST > > > > Rm# 4414 CS Dept. KAIST > > 373-1 Guseong-dong, Yuseong-gu, Daejon, South Korea (305-701) > > > > Mobile) 49) 015-251-448-278 in Germany, no cellular in Korea > > ========================================================== > > *Philip Lee* <philjj...@gmail.com> > > February 1, 2016 at 7:21 > > Hello, > > > > I experiment the performance of some systems between ORC and CSV file. > > I read about ORC documentation on Hive website, but still curious of some > things. > > > > I know ORC format is faster on filtering or reading because it has > indexing. > > Has it advantage of joining two tables of ORC dataset as well? > > > > Could you explain about it in detail? > > When experimenting, it seems like it has some advantages of joining in > some aspect, but not quite sure what characteristic of ORC make this > happening rather than CSV. > > > > Best, > > Phil > > > >