Re: Format dillema

Edward Capriolo Tue, 20 Jun 2017 19:23:48 -0700

"Hive 3.x branch has text vectorization and LLAP cache support for it, so
hopefully the only relevant concern about Text will be the storage costs
due to poor compression (& the lack of updates)."

I kept hearing about vectorization, but later found out it was going to
work if i used ORC. Litterally years have come and gone and we are talking
like 3.x is going to vectorize text. I get it that LazySimpleSerde is in
many ways the polar opposite of a batched approach that attempts to pull
down N thousand rows and process them 'in a batch'. Also get the dynamics
of the situation that people ultimately work on what they want etc etc.

Your going to laugh, but from my personal experience in a number of
environments from small to mid sized, I have actually just had the best
luck with gzip text files. When ORC was still a twinkle in someones eye I
was playing with the original RCFILES!

"The start of this thread is the exact opposite - trying to suggest ORC is
better for storage & wanting to use it."

Right, I am not trying to say that any one format is better than the other
on a case by case. I'm happy that we have something better then RCFILE, but
really generally trying to avoid Hive becoming the quasi ORC Datastore
where some not negligible part of the features ONLY work with ORC.

https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest

"Currently only ORC is supported for the format of the destination table."

Say what? I can do "INSERT INTO AVROTABLE AS SELECT * FROM JSON_TABLE ",
but somehow I can ONLY streaming ingest to a table if it is this one type.
If only a given features were supported by more than one format (what a
world it would be)!

On Tue, Jun 20, 2017 at 5:05 PM, Gopal Vijayaraghavan <[email protected]>
wrote:

>
> > 1) both do the same thing.
>
> The start of this thread is the exact opposite - trying to suggest ORC is
> better for storage & wanting to use it.
>
> > As it relates the columnar formats, it is silly arms race.
>
> I'm not sure "silly" is the operative word - we've lost a lot of
> fragmentation of the community and are down to 2 good choices, neither of
> them wrong.
>
> Impala's original format was Trevni, which lives on in Avro docs. And
> there was RCFile - a sequence file format, which stored columnar data in a
> <K,V> pair. And then there was LazySimple SequenceFile, LazyBinary
> SequenceFile, Avro and Text with many SerDes.
>
> Purely speculatively, we're headed into more fragmentation again, with
> people rediscovering that they need updates.
>
> Uber's Hoodie is the Parquet fork, but for Spark, not Impala. While ORC
> ACID is getting much easier to update with MERGE statements and a deadlock
> aware txn manager.
>
> > Parquet had C/C++ right off the bat of course because impala has to work
> in C/C++.
>
> I think that is the primary reason why the Java Parquet readers are still
> way behind in performance.
>
> Nobody sane wants to work on performance tuning a data reader library in
> Java, when it is so much easier to do it in C++.
>
> Doing C++ after tuning the format for optimal performance in Java8 makes a
> lot of sense, in hindsight. The marshmallow test is easier if you can't
> have a marshmallow now.
>
> > 1) uses text file anyway because it is the ONLY format all tools support
>
> I see this often, folks who just throw in plain text into S3 and querying
> it.
>
> Hive 3.x branch has text vectorization and LLAP cache support for it, so
> hopefully the only relevant concern about Text will be the storage costs
> due to poor compression (& the lack of updates).
>
> Cheers,
> Gopal
>
>
>

Re: Format dillema

Reply via email to