Re: Avro vs Parquet performance on Pig

Michael Doo Thu, 07 Feb 2019 15:30:07 -0800

Indeed. When loading Parquet using org.apache.parquet.pig.ParquetLoader(), 
we're specifying the schema for which columns we want to load.


On 2/7/19, 5:14 PM, "Russell Jurney" <[email protected]> wrote:

    Well, the obvious thing is to load only those columns you need. Just in
    case you’re not doing this.
    
    On Thu, Feb 7, 2019 at 2:04 PM Michael Doo <[email protected]> wrote:
    
    > Hey all,
    > I’ve been migrating some processes over from ingesting Avro to ingesting
    > Parquet. In Spark, we’re seeing 2x-8x performance gains when using Parquet
    > over Avro. In Pig, similar processes are about the same runtime between 
the
    > two formats (and sometimes even higher using Parquet). We’ve enabled
    > dictionary filtering as well as predicate filter/pushdown. Wondering if
    > there are other settings / strategies we might be missing to take 
advantage
    > of Parquet.
    >
    > Thanks,
    > Michael
    >
    -- 
    Russell Jurney @rjurney <http://twitter.com/rjurney>
    [email protected] LI <http://linkedin.com/in/russelljurney> FB
    <http://facebook.com/jurney> datasyndrome.com

Re: Avro vs Parquet performance on Pig

Reply via email to