Re: ORC v/s Parquet for Spark 2.0

Jörn Franke Thu, 28 Jul 2016 01:19:08 -0700

I see it more as a process of innovation and thus competition is good. 
Companies just should not follow these religious arguments but try themselves 
what suits them. There is more than software when using software ;)


> On 28 Jul 2016, at 01:44, Mich Talebzadeh <mich.talebza...@gmail.com> wrote:
> 
> And frankly this is becoming some sort of religious arguments now
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
>> On 28 July 2016 at 00:01, Sudhir Babu Pothineni <sbpothin...@gmail.com> 
>> wrote:
>> It depends on what you are dong, here is the recent comparison of ORC, 
>> Parquet
>> 
>> https://www.slideshare.net/mobile/oom65/file-format-benchmarks-avro-json-orc-parquet
>> 
>> Although from ORC authors, I thought fair comparison, We use ORC as System 
>> of Record on our Cloudera HDFS cluster, our experience is so far good.
>> 
>> Perquet is backed by Cloudera, which has more installations of Hadoop. ORC 
>> is by Hortonworks, so battle of file format continues...
>> 
>> Sent from my iPhone
>> 
>>> On Jul 27, 2016, at 4:54 PM, janardhan shetty <janardhan...@gmail.com> 
>>> wrote:
>>> 
>>> Seems like parquet format is better comparatively to orc when the dataset 
>>> is log data without nested structures? Is this fair understanding ?
>>> 
>>>> On Jul 27, 2016 1:30 PM, "Jörn Franke" <jornfra...@gmail.com> wrote:
>>>> Kudu has been from my impression be designed to offer somethings between 
>>>> hbase and parquet for write intensive loads - it is not faster for 
>>>> warehouse type of querying compared to parquet (merely slower, because 
>>>> that is not its use case).   I assume this is still the strategy of it.
>>>> 
>>>> For some scenarios it could make sense together with parquet and Orc. 
>>>> However I am not sure what the advantage towards using hbase + parquet and 
>>>> Orc.
>>>> 
>>>>> On 27 Jul 2016, at 11:47, "u...@moosheimer.com" <u...@moosheimer.com> 
>>>>> wrote:
>>>>> 
>>>>> Hi Gourav,
>>>>> 
>>>>> Kudu (if you mean Apache Kuda, the Cloudera originated project) is a in 
>>>>> memory db with data storage while Parquet is "only" a columnar storage 
>>>>> format.
>>>>> 
>>>>> As I understand, Kudu is a BI db to compete with Exasol or Hana (ok ... 
>>>>> that's more a wish :-).
>>>>> 
>>>>> Regards,
>>>>> Uwe
>>>>> 
>>>>> Mit freundlichen Grüßen / best regards
>>>>> Kay-Uwe Moosheimer
>>>>> 
>>>>>> Am 27.07.2016 um 09:15 schrieb Gourav Sengupta 
>>>>>> <gourav.sengu...@gmail.com>:
>>>>>> 
>>>>>> Gosh,
>>>>>> 
>>>>>> whether ORC came from this or that, it runs queries in HIVE with TEZ at 
>>>>>> a speed that is better than SPARK.
>>>>>> 
>>>>>> Has anyone heard of KUDA? Its better than Parquet. But I think that 
>>>>>> someone might just start saying that KUDA has difficult lineage as well. 
>>>>>> After all dynastic rules dictate.
>>>>>> 
>>>>>> Personally I feel that if something stores my data compressed and makes 
>>>>>> me access it faster I do not care where it comes from or how difficult 
>>>>>> the child birth was :)
>>>>>> 
>>>>>> 
>>>>>> Regards,
>>>>>> Gourav
>>>>>> 
>>>>>>> On Tue, Jul 26, 2016 at 11:19 PM, Sudhir Babu Pothineni 
>>>>>>> <sbpothin...@gmail.com> wrote:
>>>>>>> Just correction:
>>>>>>> 
>>>>>>> ORC Java libraries from Hive are forked into Apache ORC. Vectorization 
>>>>>>> default. 
>>>>>>> 
>>>>>>> Do not know If Spark leveraging this new repo?
>>>>>>> 
>>>>>>> <dependency>
>>>>>>>  <groupId>org.apache.orc</groupId>
>>>>>>>     <artifactId>orc</artifactId>
>>>>>>>     <version>1.1.2</version>
>>>>>>>     <type>pom</type>
>>>>>>> </dependency>
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Sent from my iPhone
>>>>>>>> On Jul 26, 2016, at 4:50 PM, Koert Kuipers <ko...@tresata.com> wrote:
>>>>>>>> 
>>>>>>> 
>>>>>>>> parquet was inspired by dremel but written from the ground up as a 
>>>>>>>> library with support for a variety of big data systems (hive, pig, 
>>>>>>>> impala, cascading, etc.). it is also easy to add new support, since 
>>>>>>>> its a proper library.
>>>>>>>> 
>>>>>>>> orc bas been enhanced while deployed at facebook in hive and at yahoo 
>>>>>>>> in hive. just hive. it didn't really exist by itself. it was part of 
>>>>>>>> the big java soup that is called hive, without an easy way to extract 
>>>>>>>> it. hive does not expose proper java apis. it never cared for that.
>>>>>>>> 
>>>>>>>>> On Tue, Jul 26, 2016 at 9:57 AM, Ovidiu-Cristian MARCU 
>>>>>>>>> <ovidiu-cristian.ma...@inria.fr> wrote:
>>>>>>>>> Interesting opinion, thank you
>>>>>>>>> 
>>>>>>>>> Still, on the website parquet is basically inspired by Dremel 
>>>>>>>>> (Google) [1] and part of orc has been enhanced while deployed for 
>>>>>>>>> Facebook, Yahoo [2].
>>>>>>>>> 
>>>>>>>>> Other than this presentation [3], do you guys know any other 
>>>>>>>>> benchmark?
>>>>>>>>> 
>>>>>>>>> [1]https://parquet.apache.org/documentation/latest/
>>>>>>>>> [2]https://orc.apache.org/docs/
>>>>>>>>> [3] 
>>>>>>>>> http://www.slideshare.net/oom65/file-format-benchmarks-avro-json-orc-parquet
>>>>>>>>> 
>>>>>>>>>> On 26 Jul 2016, at 15:19, Koert Kuipers <ko...@tresata.com> wrote:
>>>>>>>>>> 
>>>>>>>>>> when parquet came out it was developed by a community of companies, 
>>>>>>>>>> and was designed as a library to be supported by multiple big data 
>>>>>>>>>> projects. nice
>>>>>>>>>> 
>>>>>>>>>> orc on the other hand initially only supported hive. it wasn't even 
>>>>>>>>>> designed as a library that can be re-used. even today it brings in 
>>>>>>>>>> the kitchen sink of transitive dependencies. yikes
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> On Jul 26, 2016 5:09 AM, "Jörn Franke" <jornfra...@gmail.com> wrote:
>>>>>>>>>>> I think both are very similar, but with slightly different goals. 
>>>>>>>>>>> While they work transparently for each Hadoop application you need 
>>>>>>>>>>> to enable specific support in the application for predicate push 
>>>>>>>>>>> down. 
>>>>>>>>>>> In the end you have to check which application you are using and do 
>>>>>>>>>>> some tests (with correct predicate push down configuration). Keep 
>>>>>>>>>>> in mind that both formats work best if they are sorted on filter 
>>>>>>>>>>> columns (which is your responsibility) and if their optimatizations 
>>>>>>>>>>> are correctly configured (min max index, bloom filter, compression 
>>>>>>>>>>> etc) . 
>>>>>>>>>>> 
>>>>>>>>>>> If you need to ingest sensor data you may want to store it first in 
>>>>>>>>>>> hbase and then batch process it in large files in Orc or parquet 
>>>>>>>>>>> format.
>>>>>>>>>>> 
>>>>>>>>>>>> On 26 Jul 2016, at 04:09, janardhan shetty 
>>>>>>>>>>>> <janardhan...@gmail.com> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Just wondering advantages and disadvantages to convert data into 
>>>>>>>>>>>> ORC or Parquet. 
>>>>>>>>>>>> 
>>>>>>>>>>>> In the documentation of Spark there are numerous examples of 
>>>>>>>>>>>> Parquet format. 
>>>>>>>>>>>> 
>>>>>>>>>>>> Any strong reasons to chose Parquet over ORC file format ?
>>>>>>>>>>>> 
>>>>>>>>>>>> Also : current data compression is bzip2
>>>>>>>>>>>> 
>>>>>>>>>>>> http://stackoverflow.com/questions/32373460/parquet-vs-orc-vs-orc-with-snappy
>>>>>>>>>>>>  
>>>>>>>>>>>> This seems like biased.
>

Re: ORC v/s Parquet for Spark 2.0

Reply via email to