Re: Format dillema

Edward Capriolo Tue, 20 Jun 2017 11:42:43 -0700

"Hive and LLAP do support Parquet precisely because the developers want to
be able to process everyone's data."

Yes. But there are a number of optimizations on the Hive ORC side that we
know are not implemented on the Parquet support. Which is why I made my
statement. Impala( Parq=yes, orc=no) Hive (ORC=yes, parq=lame). E.G.

https://hortonworks.com/blog/orcfile-in-hdp-2-better-compression-better-performance/

This requires a reader that is smart enough to understand the predicates.
Fortunately ORC has had the corresponding improvements to allow predicates
to be pushed into it, and takes advantages of its inline indexes to deliver
performance benefits.

IE. Universal improvements won't happen.

"Part of having a thriving ecosystem is that there are competitors, which
creates some user confusion, but makes the ecosystem stronger. "

True in many cases. But the fork happy not-invented-here-ness is two much.
To the average user:
1) both do the same thing.
2) each vendor has some white paper power point selling you on why their
solutions is naturally better/smaller/fast.

As it relates the columnar formats, it is silly arms race. Parquet had
C/C++ right off the bat of course because impala has to work in C/C++. But
hey maybe 2.3 years later someone has a github that does that for ORC, and
maybe 3.2 years later someone adds predicate push downs in hive to parquet.

In the mean time actual users are stuck in the middle:
1) uses text file anyway because it is the ONLY format all tools support
2) makes two outputs for each query using 2x space

(Can someone please make a competitor for Oozie? *grin*)
https://github.com/apache/incubator-airflow , mrjobs, luigi,  askaban :)

On Tue, Jun 20, 2017 at 1:45 PM, Owen O'Malley <owen.omal...@gmail.com>
wrote:

>
>
> On Tue, Jun 20, 2017 at 10:12 AM, Edward Capriolo <edlinuxg...@gmail.com>
> wrote:
>
>> It is whack that two optimized row columnar formats exists and each
>> respective project (hive/impala) has good support for one and lame/no
>> support for the other.
>>
>
> We have two similar formats because they were designed at roughly the same
> time by different teams with similar, but not identical goals. Part of
> having a thriving ecosystem is that there are competitors, which creates
> some user confusion, but makes the ecosystem stronger. (Can someone please
> make a competitor for Oozie? *grin*)
>
> Hive and LLAP do support Parquet precisely because the developers want to
> be able to process everyone's data. The Impala project is free to make
> their own choices about what to work on.
>
> .. Owen
>
>

Re: Format dillema

Reply via email to