I was under the impression that ORC files with snappy compression would
prove to be better unless your processing was columnar in nature.
Isn't that the case?
On Thu, Feb 7, 2019, 21:54 Russell Jurney wrote:
> Sorry if this isn't helpful, but the other obvious thing is to store
> intermediate
You might need https://issues.apache.org/jira/browse/PIG-4092
On Thu, Feb 7, 2019 at 3:54 PM Russell Jurney
wrote:
> Sorry if this isn't helpful, but the other obvious thing is to store
> intermediate data in Parquet whenever you repeat code/data that can be
> shared between jobs. If tests
Sorry if this isn't helpful, but the other obvious thing is to store
intermediate data in Parquet whenever you repeat code/data that can be
shared between jobs. If tests indicate it is faster. Before Parquet this
wasn't necessarily advantageous as IO from disk is slower than IO through
RAM which
Indeed. When loading Parquet using org.apache.parquet.pig.ParquetLoader(),
we're specifying the schema for which columns we want to load.
On 2/7/19, 5:14 PM, "Russell Jurney" wrote:
Well, the obvious thing is to load only those columns you need. Just in
case you’re not doing this.
Well, the obvious thing is to load only those columns you need. Just in
case you’re not doing this.
On Thu, Feb 7, 2019 at 2:04 PM Michael Doo wrote:
> Hey all,
> I’ve been migrating some processes over from ingesting Avro to ingesting
> Parquet. In Spark, we’re seeing 2x-8x performance gains
Hey all,
I’ve been migrating some processes over from ingesting Avro to ingesting
Parquet. In Spark, we’re seeing 2x-8x performance gains when using Parquet over
Avro. In Pig, similar processes are about the same runtime between the two
formats (and sometimes even higher using Parquet). We’ve