Re: Aggregating multiple files by pattern (regex possible?)

Dmitriy Ryaboy Mon, 02 Jan 2012 11:16:49 -0800

Dennis,
Hadoop and Pig support globs, which may be sufficient for what you want.
The glob matching rules are described here:
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/fs/FileSystem.html#globStatus(org.apache.hadoop.fs.Path)


If those aren't sufficient, it's possible to write a custom loader to do
more advanced regex expression handling in the input format, or  you could
alter your file naming conventions / directory structure so that globs do
become sufficient.

Hope this helps.
-Dmitriy

On Mon, Jan 2, 2012 at 6:27 AM, Meyer, Dennis <[email protected]>wrote:

> Hi,
>
> We have a use-case where it would be beneficial to "select" multiple files
> to process by a regex pattern (or a loop-like functionality to dynamically
> adjust which files to pick). We have files of different types and inside
> one type they have versions where we add new data to the records, but we do
> not remove info. As the files of the same type would be very similar, this
> would be a UNION. The files are stored in a directory and look like:
>
> type-A-v1—1.avro
> type-A-v1—2.avro
> type-A-v1—3.avro
> type-A-v1—4.avro
> type-A-v2—1.avro
> type-A-v2—2.avro
> type-A-v2—3.avro
> type-A-v2—4.avro
> type-A-v2—5.avro
> type-B-v1—1.avro
> type-B-v1—2.avro
> type-B-v1—3.avro
> ….
> Same with C etc…
>
> As you can guess the v1 stands for version #1, so higher version will have
> new fields in it. Different types contain different data.
>
> It would be great if there is a possibility to address only certain files
> (aggregate all files type "A" for "v1" and "v2"). What would be the
> technique of choice here?
> The aim is to increment the version (adding fields to the records
> dynamically) without changing the aggregation itself. Of course the new
> fields will just be ignored.
>
> Thanks,
> Dennis
>

Re: Aggregating multiple files by pattern (regex possible?)

Reply via email to