You can append filename to the input record. See
https://cwiki.apache.org/confluence/display/PIG/FAQ#FAQ-Q%3AIloaddatafromadirectorywhichcontainsdifferentfile.HowdoIfindoutwherethedatacomesfrom%3F

Daniel

On Mon, Jan 2, 2012 at 6:27 AM, Meyer, Dennis <[email protected]>wrote:

> Hi,
>
> We have a use-case where it would be beneficial to "select" multiple files
> to process by a regex pattern (or a loop-like functionality to dynamically
> adjust which files to pick). We have files of different types and inside
> one type they have versions where we add new data to the records, but we do
> not remove info. As the files of the same type would be very similar, this
> would be a UNION. The files are stored in a directory and look like:
>
> type-A-v1—1.avro
> type-A-v1—2.avro
> type-A-v1—3.avro
> type-A-v1—4.avro
> type-A-v2—1.avro
> type-A-v2—2.avro
> type-A-v2—3.avro
> type-A-v2—4.avro
> type-A-v2—5.avro
> type-B-v1—1.avro
> type-B-v1—2.avro
> type-B-v1—3.avro
> ….
> Same with C etc…
>
> As you can guess the v1 stands for version #1, so higher version will have
> new fields in it. Different types contain different data.
>
> It would be great if there is a possibility to address only certain files
> (aggregate all files type "A" for "v1" and "v2"). What would be the
> technique of choice here?
> The aim is to increment the version (adding fields to the records
> dynamically) without changing the aggregation itself. Of course the new
> fields will just be ignored.
>
> Thanks,
> Dennis
>

Reply via email to