Hi,
I need to process a few hundred thousands of files (1-2 GB each) scattered in thousands of different directories. I'd like to partition/group them based on my custom logic so I can benefit from partition pruning. Each partition will contain a few hundreds files from hundreds of different directories. Is this supported? From Hive Language manual DDL, a partition can be pointed to only one location. If I add one partition for each file I plan to process, I'd end up have a few hundreds and even thousands of partitions. I suspect this might result in hundreds to thousands of MR tasks in Hadoop. I noticed there is a feature added to support pointing an external table to multiple locations listed in a symlink file: https://issues.apache.org/jira/browse/HIVE-1272 (for TextInputFormat only) Is there a similar feature in work for partition? If so, would it support other formats (avro, parquet, etc)? Thanks Tao
