Re: Avro vs Parquet performance on Pig

2019-02-15 Thread Mario Ferreira
I was under the impression that ORC files with snappy compression would prove to be better unless your processing was columnar in nature. Isn't that the case? On Thu, Feb 7, 2019, 21:54 Russell Jurney wrote: > Sorry if this isn't helpful, but the other obvious thing is to store > intermediate

Re: Avro vs Parquet performance on Pig

2019-02-11 Thread Rohini Palaniswamy
You might need https://issues.apache.org/jira/browse/PIG-4092 On Thu, Feb 7, 2019 at 3:54 PM Russell Jurney wrote: > Sorry if this isn't helpful, but the other obvious thing is to store > intermediate data in Parquet whenever you repeat code/data that can be > shared between jobs. If tests

Re: Avro vs Parquet performance on Pig

2019-02-07 Thread Russell Jurney
Sorry if this isn't helpful, but the other obvious thing is to store intermediate data in Parquet whenever you repeat code/data that can be shared between jobs. If tests indicate it is faster. Before Parquet this wasn't necessarily advantageous as IO from disk is slower than IO through RAM which

Re: Avro vs Parquet performance on Pig

2019-02-07 Thread Michael Doo
Indeed. When loading Parquet using org.apache.parquet.pig.ParquetLoader(), we're specifying the schema for which columns we want to load. On 2/7/19, 5:14 PM, "Russell Jurney" wrote: Well, the obvious thing is to load only those columns you need. Just in case you’re not doing this.

Re: Avro vs Parquet performance on Pig

2019-02-07 Thread Russell Jurney
Well, the obvious thing is to load only those columns you need. Just in case you’re not doing this. On Thu, Feb 7, 2019 at 2:04 PM Michael Doo wrote: > Hey all, > I’ve been migrating some processes over from ingesting Avro to ingesting > Parquet. In Spark, we’re seeing 2x-8x performance gains

Avro vs Parquet performance on Pig

2019-02-07 Thread Michael Doo
Hey all, I’ve been migrating some processes over from ingesting Avro to ingesting Parquet. In Spark, we’re seeing 2x-8x performance gains when using Parquet over Avro. In Pig, similar processes are about the same runtime between the two formats (and sometimes even higher using Parquet). We’ve