Hi Prav, thanks for your answer. I have already had an avro schema for a input data - dirtydata in the example. The sample works just fine. Essentially I am loading data then I eliminate some rows given by condition and then I would like to store them as clean data. The only issue here is that during the elimination when performing join those join columns are added. So the task I am facing is to remove those columns after row elimination. The schema is rather complex so naming all columns is not an option. As I want to store clean data with the same, original schema.
Does anybody know if this is possible or simpler way of performing this activity ? Many thanks Jakub On 14 October 2014 13:46, praveenesh kumar <praveen...@gmail.com> wrote: > Not sure if its the best way to do, but what you can do is run "describe > dirtydata" to see what is schema that pig defines for your avro data. > If you already have a avro schema stored somewhere in a .avsc file or you > can use avro command line tool to generate schema in a .avsc file first. > > Once you have the schema, you can pass the schema file using AvroStorage() > - > > dirtydata = LOAD '/data/0120422' USING > AvroStorage('no_schema_check','schema_file', 'hdfs path of your avsc avro > schema file'); > describe dirtydata; > > You should be able to see the schema/columns of your relation. Once you > have the schema for your pig relation, you can refer to the columns of the > relation used in join statement, by :: operator. > > So lets say your dirtydata has 2 columns (name, salary), you can refer them > (after join) using dirtydata::name, dirtydata::salary > > Its prefer to use describe statement on any relation, if you are confused > on how to refer or project from a given relation. Hope that helps. > > Regards > Prav > > On Tue, Oct 14, 2014 at 12:02 PM, Jakub Stransky <stransky...@gmail.com> > wrote: > > > Hello experienced users, > > > > I am a new to PIG and I have probably beginners question: Is is possible > to > > get original fields after the join from the relation? > > > > Suppose I have a relation A which I want to filter by data from relation > B. > > In order to find matching records I join the relations and then perform a > > filter. Than I would like to get just fields from relation A. > > > > Practical example: > > dirtydata = load '/data/0120422' using AvroStorage(); > > > > sodtr = filter dirtydata by TransactionBlockNumber == 1; > > sto = foreach sodtr generate Dob.Value as Dob,StoreId, > > Created.UnixUtcTime; > > g = GROUP sto BY (Dob,StoreId); > > sodtime = FOREACH g GENERATE group.Dob AS Dob, group.StoreId as StoreId, > > MAX(sto.UnixUtcTime) AS latestStartOfDayTime; > > > > joined = join dirtydata by (Dob.Value, StoreId) LEFT OUTER, sodtime by > > (Dob, StoreId); > > > > cleandata = filter joined by dirtydata::Created.UnixUtcTime >= > > sodtime.latestStartOfDayTime; > > finaldata = FOREACH cleandata generate dirtydata:: ; -- <-- HERE I would > > like to get just colimns which belonged to original relation. Avro schema > > is rather complicated so it is not feasible to name are columns here. > > > > What is the best practice in that case? Is there any function? Or Is > there > > a completely different approach to solve this kind of tasks? > > > > Thanks a lot for any help > > Jakub > > > > > > > > -- > > > -- Jakub Stransky cz.linkedin.com/in/jakubstransky