Re:Re: ORC v/s Parquet for Spark 2.0

prosp4300 Wed, 27 Jul 2016 00:35:01 -0700

Thanks for this immediate correction :)


在 2016-07-27 15:17:54，"Gourav Sengupta" <gourav.sengu...@gmail.com> 写道：

Sorry, 


in my email above I was referring to KUDU, and there is goes how can KUDU be 
right if it is mentioned in forums first with a wrong spelling. Its got a 
difficult beginning where people were trying to figure out its name.




Regards,
Gourav Sengupta


On Wed, Jul 27, 2016 at 8:15 AM, Gourav Sengupta <gourav.sengu...@gmail.com> 
wrote:

Gosh,


whether ORC came from this or that, it runs queries in HIVE with TEZ at a speed 
that is better than SPARK.


Has anyone heard of KUDA? Its better than Parquet. But I think that someone 
might just start saying that KUDA has difficult lineage as well. After all 
dynastic rules dictate.


Personally I feel that if something stores my data compressed and makes me 
access it faster I do not care where it comes from or how difficult the child 
birth was :)




Regards,
Gourav


On Tue, Jul 26, 2016 at 11:19 PM, Sudhir Babu Pothineni <sbpothin...@gmail.com> 
wrote:

Just correction:


ORC Java libraries from Hive are forked into Apache ORC. Vectorization default. 


Do not know If Spark leveraging this new repo?


<dependency>
 <groupId>org.apache.orc</groupId>
    <artifactId>orc</artifactId>
    <version>1.1.2</version>
    <type>pom</type>
</dependency>













Sent from my iPhone
On Jul 26, 2016, at 4:50 PM, Koert Kuipers <ko...@tresata.com> wrote:


parquet was inspired by dremel but written from the ground up as a library with 
support for a variety of big data systems (hive, pig, impala, cascading, etc.). 
it is also easy to add new support, since its a proper library.


orc bas been enhanced while deployed at facebook in hive and at yahoo in hive. 
just hive. it didn't really exist by itself. it was part of the big java soup 
that is called hive, without an easy way to extract it. hive does not expose 
proper java apis. it never cared for that.



On Tue, Jul 26, 2016 at 9:57 AM, Ovidiu-Cristian MARCU 
<ovidiu-cristian.ma...@inria.fr> wrote:

Interesting opinion, thank you


Still, on the website parquet is basically inspired by Dremel (Google) [1] and 
part of orc has been enhanced while deployed for Facebook, Yahoo [2].


Other than this presentation [3], do you guys know any other benchmark?


[1]https://parquet.apache.org/documentation/latest/
[2]https://orc.apache.org/docs/
[3] http://www.slideshare.net/oom65/file-format-benchmarks-avro-json-orc-parquet


On 26 Jul 2016, at 15:19, Koert Kuipers <ko...@tresata.com> wrote:



when parquet came out it was developed by a community of companies, and was 
designed as a library to be supported by multiple big data projects. nice

orc on the other hand initially only supported hive. it wasn't even designed as 
a library that can be re-used. even today it brings in the kitchen sink of 
transitive dependencies. yikes



On Jul 26, 2016 5:09 AM, "Jörn Franke" <jornfra...@gmail.com> wrote:

I think both are very similar, but with slightly different goals. While they 
work transparently for each Hadoop application you need to enable specific 
support in the application for predicate push down. 
In the end you have to check which application you are using and do some tests 
(with correct predicate push down configuration). Keep in mind that both 
formats work best if they are sorted on filter columns (which is your 
responsibility) and if their optimatizations are correctly configured (min max 
index, bloom filter, compression etc) . 


If you need to ingest sensor data you may want to store it first in hbase and 
then batch process it in large files in Orc or parquet format.

On 26 Jul 2016, at 04:09, janardhan shetty <janardhan...@gmail.com> wrote:


Just wondering advantages and disadvantages to convert data into ORC or Parquet.


In the documentation of Spark there are numerous examples of Parquet format.



Any strong reasons to chose Parquet over ORC file format ?


Also : current data compression is bzip2



http://stackoverflow.com/questions/32373460/parquet-vs-orc-vs-orc-with-snappy
This seems like biased.

Re:Re: ORC v/s Parquet for Spark 2.0

Reply via email to