Hi Prav,

thanks for your answer. I have already had an avro schema for a input data
- dirtydata in the example. The sample works just fine. Essentially I am
loading data then I eliminate some rows given by condition and then I would
like to store them as clean data. The only issue here is that during the
elimination when performing join those join columns are added. So the task
I am facing is to remove those columns after row elimination. The schema is
rather complex so naming all columns is not an option. As I want to store
clean data with the same, original schema.

Does anybody know if this is possible or simpler way of performing this
activity ?

Many thanks
Jakub

On 14 October 2014 13:46, praveenesh kumar <praveen...@gmail.com> wrote:

> Not sure if its the best way to do, but what you can do is run "describe
> dirtydata" to see what is schema that pig defines for your avro data.
> If you already have a avro schema stored somewhere in a .avsc file or you
> can use avro command line tool to generate schema in a .avsc file first.
>
> Once you have the schema, you can pass the schema file using AvroStorage()
> -
>
> dirtydata = LOAD '/data/0120422' USING
> AvroStorage('no_schema_check','schema_file', 'hdfs path of your avsc avro
> schema file');
> describe dirtydata;
>
> You should be able to see the schema/columns of your relation. Once you
> have the schema for your pig relation, you can refer to the columns of the
> relation used in join statement, by :: operator.
>
> So lets say your dirtydata has 2 columns (name, salary), you can refer them
> (after join) using dirtydata::name, dirtydata::salary
>
> Its prefer to use describe statement on any relation, if you are confused
> on how to refer or project from a given relation. Hope that helps.
>
> Regards
> Prav
>
> On Tue, Oct 14, 2014 at 12:02 PM, Jakub Stransky <stransky...@gmail.com>
> wrote:
>
> > Hello experienced users,
> >
> > I am a new to PIG and I have probably beginners question: Is is possible
> to
> > get original fields after the join from the relation?
> >
> > Suppose I have a relation A which I want to filter by data from relation
> B.
> > In order to find matching records I join the relations and then perform a
> > filter. Than I would like to get just fields from relation A.
> >
> > Practical example:
> > dirtydata = load '/data/0120422' using AvroStorage();
> >
> > sodtr = filter dirtydata by TransactionBlockNumber == 1;
> > sto   = foreach sodtr generate Dob.Value as Dob,StoreId,
> > Created.UnixUtcTime;
> > g     = GROUP sto BY  (Dob,StoreId);
> > sodtime = FOREACH g GENERATE group.Dob AS Dob, group.StoreId as StoreId,
> > MAX(sto.UnixUtcTime) AS latestStartOfDayTime;
> >
> > joined = join dirtydata by (Dob.Value, StoreId) LEFT OUTER, sodtime by
> > (Dob, StoreId);
> >
> > cleandata = filter joined by dirtydata::Created.UnixUtcTime >=
> > sodtime.latestStartOfDayTime;
> > finaldata = FOREACH cleandata generate dirtydata:: ;  -- <-- HERE I would
> > like to get just colimns which belonged to original relation. Avro schema
> > is rather complicated so it is not feasible to name are columns here.
> >
> > What is the best practice in that case? Is there any function? Or Is
> there
> > a completely different approach to solve this kind of tasks?
> >
> > Thanks a lot for any help
> > Jakub
> >
> >
> >
> > --
> >
>



-- 
Jakub Stransky
cz.linkedin.com/in/jakubstransky

Reply via email to