Re: Reading partitioned Parquet data into Pig

Michael Doo Fri, 31 Aug 2018 05:19:48 -0700

Eyal,

The Parquet Pig loader is fine if all the data is present, but if I've written 
out from Spark using `df.write.partitionBy('colA', 
'colB').parquet('s3://path/to/output')`, the data from those two columns are 
put into the output path and taken out from the data: 
s3://path/to/output/colA=valA/colB=valB/part-0001.parquet. There are hacky 
workarounds, such as duplicating the columns in Spark before writing, which fix 
the issue of loading into Pig but then mean they re-appear in the data when you 
read back into Spark.


Best,
Michael 

On 8/30/18, 10:15 AM, "Adam Szita" <[email protected]> wrote:

    Hi Eyal,
    
    For just loading Parquet files the Parquet Pig loader is okay, although I
    don't think it lets you use partition values in the dataset later.
    I know the plain old PigStorage has a trick with -tagFiles option but not
    sure if that'd be enough in Michael's case and also if that's something
    Parquet Loader supports.
    
    Thanks
    
    On Thu, 30 Aug 2018 at 16:10, Eyal Allweil <[email protected]>
    wrote:
    
    > Hi Michael,
    > You can also use the Parquet Pig loader (especially if you're not working
    > with Hive). Here's a link to the Maven repository for it.
    >
    > https://mvnrepository.com/artifact/org.apache.parquet/parquet-pig/1.10.0
    > Regards,Eyal
    > 
<https://mvnrepository.com/artifact/org.apache.parquet/parquet-pig/1.10.0Regards,Eyal>
    >
    >
    >
    >
    >
    >    On Tuesday, August 28, 2018, 2:40:36 PM GMT+3, Adam Szita
    > <[email protected]> wrote:
    >
    >  Hi Michael,
    >
    > Yes you can use HCatLoader to do this.
    > The requirement is that you have a Hive table defined on top of your data
    > (probably pointing to s3://path/to/files) (and Hive MetaStore has all the
    > relevant meta/schema information).
    > If you do not have a Hive table yet, you can go ahead and define it in 
Hive
    > by manually specifying schema information, and after that partitions can 
be
    > added automatically via the 'msck repair' function of Hive.
    >
    > Hope this helps,
    > Adam
    >
    >
    > On Mon, 27 Aug 2018 at 19:18, Michael Doo <[email protected]> wrote:
    >
    > > Hello,
    > >
    > > I’m trying to read in Parquet data into Pig that is partitioned (so it’s
    > > stored in S3 like
    > >
    > 
s3://path/to/files/some_flag=true/part-00095-a2a6230b-9750-48e4-9cd0-b553ffc220de.c000.gz.parquet).
    > > I’d like to load it into Pig and add the partitions as columns. I’ve 
read
    > > some resources suggesting using the HCatLoader, but so far haven’t had
    > > success.
    > >
    > > Any advice would be welcome.
    > >
    > > ~ Michael
    > >

Re: Reading partitioned Parquet data into Pig

Reply via email to