Hi,

I need to process a few hundred thousands of files (1-2 GB each) scattered
in thousands of different directories.

I'd like to partition/group them based on my custom logic so I can benefit
from partition pruning. Each partition will contain a few hundreds files
from hundreds of different directories.

Is this supported? From Hive Language manual DDL, a partition can be
pointed to only one location. If I add one partition for each file I plan
to process, I'd end up have a few hundreds and even thousands of
partitions. I suspect this might result in hundreds to thousands of MR
tasks in Hadoop.

I noticed there is a feature added to support pointing an external table to
multiple locations listed in a symlink file:
https://issues.apache.org/jira/browse/HIVE-1272 (for TextInputFormat only)

Is there a similar feature in work for partition? If so, would it support
other formats (avro, parquet, etc)?


Thanks

Tao

Reply via email to