Re: Original fields after join

praveenesh kumar Tue, 14 Oct 2014 06:28:19 -0700

If you know the first and the last column, you can use the pig range
operator, something like "foreach <relation> generate
<first_col>..<last_col>;"
Pig will take all the columns automatically that comes in between those
columns


On Tue, Oct 14, 2014 at 1:40 PM, Jakub Stransky <stransky...@gmail.com>
wrote:

> Hi Prav,
>
> thanks for your answer. I have already had an avro schema for a input data
> - dirtydata in the example. The sample works just fine. Essentially I am
> loading data then I eliminate some rows given by condition and then I would
> like to store them as clean data. The only issue here is that during the
> elimination when performing join those join columns are added. So the task
> I am facing is to remove those columns after row elimination. The schema is
> rather complex so naming all columns is not an option. As I want to store
> clean data with the same, original schema.
>
> Does anybody know if this is possible or simpler way of performing this
> activity ?
>
> Many thanks
> Jakub
>
> On 14 October 2014 13:46, praveenesh kumar <praveen...@gmail.com> wrote:
>
> > Not sure if its the best way to do, but what you can do is run "describe
> > dirtydata" to see what is schema that pig defines for your avro data.
> > If you already have a avro schema stored somewhere in a .avsc file or you
> > can use avro command line tool to generate schema in a .avsc file first.
> >
> > Once you have the schema, you can pass the schema file using
> AvroStorage()
> > -
> >
> > dirtydata = LOAD '/data/0120422' USING
> > AvroStorage('no_schema_check','schema_file', 'hdfs path of your avsc avro
> > schema file');
> > describe dirtydata;
> >
> > You should be able to see the schema/columns of your relation. Once you
> > have the schema for your pig relation, you can refer to the columns of
> the
> > relation used in join statement, by :: operator.
> >
> > So lets say your dirtydata has 2 columns (name, salary), you can refer
> them
> > (after join) using dirtydata::name, dirtydata::salary
> >
> > Its prefer to use describe statement on any relation, if you are confused
> > on how to refer or project from a given relation. Hope that helps.
> >
> > Regards
> > Prav
> >
> > On Tue, Oct 14, 2014 at 12:02 PM, Jakub Stransky <stransky...@gmail.com>
> > wrote:
> >
> > > Hello experienced users,
> > >
> > > I am a new to PIG and I have probably beginners question: Is is
> possible
> > to
> > > get original fields after the join from the relation?
> > >
> > > Suppose I have a relation A which I want to filter by data from
> relation
> > B.
> > > In order to find matching records I join the relations and then
> perform a
> > > filter. Than I would like to get just fields from relation A.
> > >
> > > Practical example:
> > > dirtydata = load '/data/0120422' using AvroStorage();
> > >
> > > sodtr = filter dirtydata by TransactionBlockNumber == 1;
> > > sto   = foreach sodtr generate Dob.Value as Dob,StoreId,
> > > Created.UnixUtcTime;
> > > g     = GROUP sto BY  (Dob,StoreId);
> > > sodtime = FOREACH g GENERATE group.Dob AS Dob, group.StoreId as
> StoreId,
> > > MAX(sto.UnixUtcTime) AS latestStartOfDayTime;
> > >
> > > joined = join dirtydata by (Dob.Value, StoreId) LEFT OUTER, sodtime by
> > > (Dob, StoreId);
> > >
> > > cleandata = filter joined by dirtydata::Created.UnixUtcTime >=
> > > sodtime.latestStartOfDayTime;
> > > finaldata = FOREACH cleandata generate dirtydata:: ;  -- <-- HERE I
> would
> > > like to get just colimns which belonged to original relation. Avro
> schema
> > > is rather complicated so it is not feasible to name are columns here.
> > >
> > > What is the best practice in that case? Is there any function? Or Is
> > there
> > > a completely different approach to solve this kind of tasks?
> > >
> > > Thanks a lot for any help
> > > Jakub
> > >
> > >
> > >
> > > --
> > >
> >
>
>
>
> --
> Jakub Stransky
> cz.linkedin.com/in/jakubstransky
>

Re: Original fields after join

Reply via email to