I see it more as a process of innovation and thus competition is good. Companies just should not follow these religious arguments but try themselves what suits them. There is more than software when using software ;)
> On 28 Jul 2016, at 01:44, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > > And frankly this is becoming some sort of religious arguments now > > > > Dr Mich Talebzadeh > > LinkedIn > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > > http://talebzadehmich.wordpress.com > > Disclaimer: Use it at your own risk. Any and all responsibility for any loss, > damage or destruction of data or any other property which may arise from > relying on this email's technical content is explicitly disclaimed. The > author will in no case be liable for any monetary damages arising from such > loss, damage or destruction. > > >> On 28 July 2016 at 00:01, Sudhir Babu Pothineni <sbpothin...@gmail.com> >> wrote: >> It depends on what you are dong, here is the recent comparison of ORC, >> Parquet >> >> https://www.slideshare.net/mobile/oom65/file-format-benchmarks-avro-json-orc-parquet >> >> Although from ORC authors, I thought fair comparison, We use ORC as System >> of Record on our Cloudera HDFS cluster, our experience is so far good. >> >> Perquet is backed by Cloudera, which has more installations of Hadoop. ORC >> is by Hortonworks, so battle of file format continues... >> >> Sent from my iPhone >> >>> On Jul 27, 2016, at 4:54 PM, janardhan shetty <janardhan...@gmail.com> >>> wrote: >>> >>> Seems like parquet format is better comparatively to orc when the dataset >>> is log data without nested structures? Is this fair understanding ? >>> >>>> On Jul 27, 2016 1:30 PM, "Jörn Franke" <jornfra...@gmail.com> wrote: >>>> Kudu has been from my impression be designed to offer somethings between >>>> hbase and parquet for write intensive loads - it is not faster for >>>> warehouse type of querying compared to parquet (merely slower, because >>>> that is not its use case). I assume this is still the strategy of it. >>>> >>>> For some scenarios it could make sense together with parquet and Orc. >>>> However I am not sure what the advantage towards using hbase + parquet and >>>> Orc. >>>> >>>>> On 27 Jul 2016, at 11:47, "u...@moosheimer.com" <u...@moosheimer.com> >>>>> wrote: >>>>> >>>>> Hi Gourav, >>>>> >>>>> Kudu (if you mean Apache Kuda, the Cloudera originated project) is a in >>>>> memory db with data storage while Parquet is "only" a columnar storage >>>>> format. >>>>> >>>>> As I understand, Kudu is a BI db to compete with Exasol or Hana (ok ... >>>>> that's more a wish :-). >>>>> >>>>> Regards, >>>>> Uwe >>>>> >>>>> Mit freundlichen Grüßen / best regards >>>>> Kay-Uwe Moosheimer >>>>> >>>>>> Am 27.07.2016 um 09:15 schrieb Gourav Sengupta >>>>>> <gourav.sengu...@gmail.com>: >>>>>> >>>>>> Gosh, >>>>>> >>>>>> whether ORC came from this or that, it runs queries in HIVE with TEZ at >>>>>> a speed that is better than SPARK. >>>>>> >>>>>> Has anyone heard of KUDA? Its better than Parquet. But I think that >>>>>> someone might just start saying that KUDA has difficult lineage as well. >>>>>> After all dynastic rules dictate. >>>>>> >>>>>> Personally I feel that if something stores my data compressed and makes >>>>>> me access it faster I do not care where it comes from or how difficult >>>>>> the child birth was :) >>>>>> >>>>>> >>>>>> Regards, >>>>>> Gourav >>>>>> >>>>>>> On Tue, Jul 26, 2016 at 11:19 PM, Sudhir Babu Pothineni >>>>>>> <sbpothin...@gmail.com> wrote: >>>>>>> Just correction: >>>>>>> >>>>>>> ORC Java libraries from Hive are forked into Apache ORC. Vectorization >>>>>>> default. >>>>>>> >>>>>>> Do not know If Spark leveraging this new repo? >>>>>>> >>>>>>> <dependency> >>>>>>> <groupId>org.apache.orc</groupId> >>>>>>> <artifactId>orc</artifactId> >>>>>>> <version>1.1.2</version> >>>>>>> <type>pom</type> >>>>>>> </dependency> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> Sent from my iPhone >>>>>>>> On Jul 26, 2016, at 4:50 PM, Koert Kuipers <ko...@tresata.com> wrote: >>>>>>>> >>>>>>> >>>>>>>> parquet was inspired by dremel but written from the ground up as a >>>>>>>> library with support for a variety of big data systems (hive, pig, >>>>>>>> impala, cascading, etc.). it is also easy to add new support, since >>>>>>>> its a proper library. >>>>>>>> >>>>>>>> orc bas been enhanced while deployed at facebook in hive and at yahoo >>>>>>>> in hive. just hive. it didn't really exist by itself. it was part of >>>>>>>> the big java soup that is called hive, without an easy way to extract >>>>>>>> it. hive does not expose proper java apis. it never cared for that. >>>>>>>> >>>>>>>>> On Tue, Jul 26, 2016 at 9:57 AM, Ovidiu-Cristian MARCU >>>>>>>>> <ovidiu-cristian.ma...@inria.fr> wrote: >>>>>>>>> Interesting opinion, thank you >>>>>>>>> >>>>>>>>> Still, on the website parquet is basically inspired by Dremel >>>>>>>>> (Google) [1] and part of orc has been enhanced while deployed for >>>>>>>>> Facebook, Yahoo [2]. >>>>>>>>> >>>>>>>>> Other than this presentation [3], do you guys know any other >>>>>>>>> benchmark? >>>>>>>>> >>>>>>>>> [1]https://parquet.apache.org/documentation/latest/ >>>>>>>>> [2]https://orc.apache.org/docs/ >>>>>>>>> [3] >>>>>>>>> http://www.slideshare.net/oom65/file-format-benchmarks-avro-json-orc-parquet >>>>>>>>> >>>>>>>>>> On 26 Jul 2016, at 15:19, Koert Kuipers <ko...@tresata.com> wrote: >>>>>>>>>> >>>>>>>>>> when parquet came out it was developed by a community of companies, >>>>>>>>>> and was designed as a library to be supported by multiple big data >>>>>>>>>> projects. nice >>>>>>>>>> >>>>>>>>>> orc on the other hand initially only supported hive. it wasn't even >>>>>>>>>> designed as a library that can be re-used. even today it brings in >>>>>>>>>> the kitchen sink of transitive dependencies. yikes >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> On Jul 26, 2016 5:09 AM, "Jörn Franke" <jornfra...@gmail.com> wrote: >>>>>>>>>>> I think both are very similar, but with slightly different goals. >>>>>>>>>>> While they work transparently for each Hadoop application you need >>>>>>>>>>> to enable specific support in the application for predicate push >>>>>>>>>>> down. >>>>>>>>>>> In the end you have to check which application you are using and do >>>>>>>>>>> some tests (with correct predicate push down configuration). Keep >>>>>>>>>>> in mind that both formats work best if they are sorted on filter >>>>>>>>>>> columns (which is your responsibility) and if their optimatizations >>>>>>>>>>> are correctly configured (min max index, bloom filter, compression >>>>>>>>>>> etc) . >>>>>>>>>>> >>>>>>>>>>> If you need to ingest sensor data you may want to store it first in >>>>>>>>>>> hbase and then batch process it in large files in Orc or parquet >>>>>>>>>>> format. >>>>>>>>>>> >>>>>>>>>>>> On 26 Jul 2016, at 04:09, janardhan shetty >>>>>>>>>>>> <janardhan...@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>> Just wondering advantages and disadvantages to convert data into >>>>>>>>>>>> ORC or Parquet. >>>>>>>>>>>> >>>>>>>>>>>> In the documentation of Spark there are numerous examples of >>>>>>>>>>>> Parquet format. >>>>>>>>>>>> >>>>>>>>>>>> Any strong reasons to chose Parquet over ORC file format ? >>>>>>>>>>>> >>>>>>>>>>>> Also : current data compression is bzip2 >>>>>>>>>>>> >>>>>>>>>>>> http://stackoverflow.com/questions/32373460/parquet-vs-orc-vs-orc-with-snappy >>>>>>>>>>>> >>>>>>>>>>>> This seems like biased. >