You can append filename to the input record. See https://cwiki.apache.org/confluence/display/PIG/FAQ#FAQ-Q%3AIloaddatafromadirectorywhichcontainsdifferentfile.HowdoIfindoutwherethedatacomesfrom%3F
Daniel On Mon, Jan 2, 2012 at 6:27 AM, Meyer, Dennis <[email protected]>wrote: > Hi, > > We have a use-case where it would be beneficial to "select" multiple files > to process by a regex pattern (or a loop-like functionality to dynamically > adjust which files to pick). We have files of different types and inside > one type they have versions where we add new data to the records, but we do > not remove info. As the files of the same type would be very similar, this > would be a UNION. The files are stored in a directory and look like: > > type-A-v1—1.avro > type-A-v1—2.avro > type-A-v1—3.avro > type-A-v1—4.avro > type-A-v2—1.avro > type-A-v2—2.avro > type-A-v2—3.avro > type-A-v2—4.avro > type-A-v2—5.avro > type-B-v1—1.avro > type-B-v1—2.avro > type-B-v1—3.avro > …. > Same with C etc… > > As you can guess the v1 stands for version #1, so higher version will have > new fields in it. Different types contain different data. > > It would be great if there is a possibility to address only certain files > (aggregate all files type "A" for "v1" and "v2"). What would be the > technique of choice here? > The aim is to increment the version (adding fields to the records > dynamically) without changing the aggregation itself. Of course the new > fields will just be ignored. > > Thanks, > Dennis >
