If you know the first and the last column, you can use the pig range operator, something like "foreach <relation> generate <first_col>..<last_col>;" Pig will take all the columns automatically that comes in between those columns
On Tue, Oct 14, 2014 at 1:40 PM, Jakub Stransky <stransky...@gmail.com> wrote: > Hi Prav, > > thanks for your answer. I have already had an avro schema for a input data > - dirtydata in the example. The sample works just fine. Essentially I am > loading data then I eliminate some rows given by condition and then I would > like to store them as clean data. The only issue here is that during the > elimination when performing join those join columns are added. So the task > I am facing is to remove those columns after row elimination. The schema is > rather complex so naming all columns is not an option. As I want to store > clean data with the same, original schema. > > Does anybody know if this is possible or simpler way of performing this > activity ? > > Many thanks > Jakub > > On 14 October 2014 13:46, praveenesh kumar <praveen...@gmail.com> wrote: > > > Not sure if its the best way to do, but what you can do is run "describe > > dirtydata" to see what is schema that pig defines for your avro data. > > If you already have a avro schema stored somewhere in a .avsc file or you > > can use avro command line tool to generate schema in a .avsc file first. > > > > Once you have the schema, you can pass the schema file using > AvroStorage() > > - > > > > dirtydata = LOAD '/data/0120422' USING > > AvroStorage('no_schema_check','schema_file', 'hdfs path of your avsc avro > > schema file'); > > describe dirtydata; > > > > You should be able to see the schema/columns of your relation. Once you > > have the schema for your pig relation, you can refer to the columns of > the > > relation used in join statement, by :: operator. > > > > So lets say your dirtydata has 2 columns (name, salary), you can refer > them > > (after join) using dirtydata::name, dirtydata::salary > > > > Its prefer to use describe statement on any relation, if you are confused > > on how to refer or project from a given relation. Hope that helps. > > > > Regards > > Prav > > > > On Tue, Oct 14, 2014 at 12:02 PM, Jakub Stransky <stransky...@gmail.com> > > wrote: > > > > > Hello experienced users, > > > > > > I am a new to PIG and I have probably beginners question: Is is > possible > > to > > > get original fields after the join from the relation? > > > > > > Suppose I have a relation A which I want to filter by data from > relation > > B. > > > In order to find matching records I join the relations and then > perform a > > > filter. Than I would like to get just fields from relation A. > > > > > > Practical example: > > > dirtydata = load '/data/0120422' using AvroStorage(); > > > > > > sodtr = filter dirtydata by TransactionBlockNumber == 1; > > > sto = foreach sodtr generate Dob.Value as Dob,StoreId, > > > Created.UnixUtcTime; > > > g = GROUP sto BY (Dob,StoreId); > > > sodtime = FOREACH g GENERATE group.Dob AS Dob, group.StoreId as > StoreId, > > > MAX(sto.UnixUtcTime) AS latestStartOfDayTime; > > > > > > joined = join dirtydata by (Dob.Value, StoreId) LEFT OUTER, sodtime by > > > (Dob, StoreId); > > > > > > cleandata = filter joined by dirtydata::Created.UnixUtcTime >= > > > sodtime.latestStartOfDayTime; > > > finaldata = FOREACH cleandata generate dirtydata:: ; -- <-- HERE I > would > > > like to get just colimns which belonged to original relation. Avro > schema > > > is rather complicated so it is not feasible to name are columns here. > > > > > > What is the best practice in that case? Is there any function? Or Is > > there > > > a completely different approach to solve this kind of tasks? > > > > > > Thanks a lot for any help > > > Jakub > > > > > > > > > > > > -- > > > > > > > > > -- > Jakub Stransky > cz.linkedin.com/in/jakubstransky >