I solved my own problem. For anyone who's curious:

It turns out that subclassing an InputFormat allows one to override the
listStatus method, which returns the list of files for Hive (or mapreduce in
general) to process. All I had to do was subclass
org.apache.hadoop.mapred.TextInputFormat and override the listStatus method
and voila; I was able to make it ignore directories. Here's the java code
that I used:

public class TextFileInputFormatIgnoreSubDir extends TextInputFormat {
    @Override
    protected FileStatus[] listStatus (JobConf job) throws IOException {
        FileStatus[] files = super.listStatus(job);
        List<FileStatus> newFiles = new ArrayList<FileStatus>();
        int len = files.length;
        for (int i = 0; i < len; ++i) {
            FileStatus file = files[i];
            if (!file.isDir()) {
                newFiles.add(file);
            }
        }

        files = new FileStatus[newFiles.size()];
        for (int i = 0; i < newFiles.size(); ++i) {
            files[i] = newFiles.get(i);
        }

        return files;
    }
}

And the HiveQL code I used to define the table:

CREATE EXTERNAL TABLE users (id STRING, user_name STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS INPUTFORMAT
'com.example.mapreduce.input.TextFileInputFormatIgnoreSubDir'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION '/data/test/users';

Hope this saves someone else the trouble of figuring it out...

-Dave

On Thu, Aug 18, 2011 at 3:53 PM, Dave <drive...@gmail.com> wrote:

> Hi,
>
> I have a partitioned external table in Hive, and in the partition
> directories there are other subdirectories that are not related to the table
> itself. Hive seems to want to scan those directories, as I am getting an
> error message when trying to do a SELECT on the table:
>
> Failed with exception java.io.IOException:java.io.IOException: Not a file:
> hdfs://path/to/partition/path/to/subdir
>
> Also, it seems to ignore directories prefixed by an underscore
> (_directory).
>
> I am using hive 0.7.1 on Hadoop 0.20.2.
>
> Is there a way to force Hive to ignore all subdirectories in external
> tables and only look at files?
>
> Thanks in advance,
> -Dave
>

Reply via email to