"Hive and LLAP do support Parquet precisely because the developers want to be able to process everyone's data."
Yes. But there are a number of optimizations on the Hive ORC side that we know are not implemented on the Parquet support. Which is why I made my statement. Impala( Parq=yes, orc=no) Hive (ORC=yes, parq=lame). E.G. https://hortonworks.com/blog/orcfile-in-hdp-2-better-compression-better-performance/ This requires a reader that is smart enough to understand the predicates. Fortunately ORC has had the corresponding improvements to allow predicates to be pushed into it, and takes advantages of its inline indexes to deliver performance benefits. IE. Universal improvements won't happen. "Part of having a thriving ecosystem is that there are competitors, which creates some user confusion, but makes the ecosystem stronger. " True in many cases. But the fork happy not-invented-here-ness is two much. To the average user: 1) both do the same thing. 2) each vendor has some white paper power point selling you on why their solutions is naturally better/smaller/fast. As it relates the columnar formats, it is silly arms race. Parquet had C/C++ right off the bat of course because impala has to work in C/C++. But hey maybe 2.3 years later someone has a github that does that for ORC, and maybe 3.2 years later someone adds predicate push downs in hive to parquet. In the mean time actual users are stuck in the middle: 1) uses text file anyway because it is the ONLY format all tools support 2) makes two outputs for each query using 2x space (Can someone please make a competitor for Oozie? *grin*) https://github.com/apache/incubator-airflow , mrjobs, luigi, askaban :) On Tue, Jun 20, 2017 at 1:45 PM, Owen O'Malley <owen.omal...@gmail.com> wrote: > > > On Tue, Jun 20, 2017 at 10:12 AM, Edward Capriolo <edlinuxg...@gmail.com> > wrote: > >> It is whack that two optimized row columnar formats exists and each >> respective project (hive/impala) has good support for one and lame/no >> support for the other. >> > > We have two similar formats because they were designed at roughly the same > time by different teams with similar, but not identical goals. Part of > having a thriving ecosystem is that there are competitors, which creates > some user confusion, but makes the ecosystem stronger. (Can someone please > make a competitor for Oozie? *grin*) > > Hive and LLAP do support Parquet precisely because the developers want to > be able to process everyone's data. The Impala project is free to make > their own choices about what to work on. > > .. Owen > >