Eyal,
The Parquet Pig loader is fine if all the data is present, but if I've written
out from Spark using `df.write.partitionBy('colA',
'colB').parquet('s3://path/to/output')`, the data from those two columns are
put into the output path and taken out from the data:
s3://path/to/output/colA=valA/colB=valB/part-0001.parquet. There are hacky
workarounds, such as duplicating the columns in Spark before writing, which fix
the issue of loading into Pig but then mean they re-appear in the data when you
read back into Spark.
Best,
Michael
On 8/30/18, 10:15 AM, "Adam Szita" <[email protected]> wrote:
Hi Eyal,
For just loading Parquet files the Parquet Pig loader is okay, although I
don't think it lets you use partition values in the dataset later.
I know the plain old PigStorage has a trick with -tagFiles option but not
sure if that'd be enough in Michael's case and also if that's something
Parquet Loader supports.
Thanks
On Thu, 30 Aug 2018 at 16:10, Eyal Allweil <[email protected]>
wrote:
> Hi Michael,
> You can also use the Parquet Pig loader (especially if you're not working
> with Hive). Here's a link to the Maven repository for it.
>
> https://mvnrepository.com/artifact/org.apache.parquet/parquet-pig/1.10.0
> Regards,Eyal
>
<https://mvnrepository.com/artifact/org.apache.parquet/parquet-pig/1.10.0Regards,Eyal>
>
>
>
>
>
> On Tuesday, August 28, 2018, 2:40:36 PM GMT+3, Adam Szita
> <[email protected]> wrote:
>
> Hi Michael,
>
> Yes you can use HCatLoader to do this.
> The requirement is that you have a Hive table defined on top of your data
> (probably pointing to s3://path/to/files) (and Hive MetaStore has all the
> relevant meta/schema information).
> If you do not have a Hive table yet, you can go ahead and define it in
Hive
> by manually specifying schema information, and after that partitions can
be
> added automatically via the 'msck repair' function of Hive.
>
> Hope this helps,
> Adam
>
>
> On Mon, 27 Aug 2018 at 19:18, Michael Doo <[email protected]> wrote:
>
> > Hello,
> >
> > I’m trying to read in Parquet data into Pig that is partitioned (so it’s
> > stored in S3 like
> >
>
s3://path/to/files/some_flag=true/part-00095-a2a6230b-9750-48e4-9cd0-b553ffc220de.c000.gz.parquet).
> > I’d like to load it into Pig and add the partitions as columns. I’ve
read
> > some resources suggesting using the HCatLoader, but so far haven’t had
> > success.
> >
> > Any advice would be welcome.
> >
> > ~ Michael
> >