I am writing a loader for a storage format, which partitions by a particular field in the record. So I would like to implement something which can push down filters on the partitioned field so that the record reader does not need to read files that are outside the filtered range. In the interface "LoadMetadata", the "getPartitionKeys" and "setPartitionFilter" functions seem to support what I need (where Pig should pass the filtering expression on the declared partition keys to "setPartitionFilter", but I have a couple of questions. I'm going to reference the following example, where timestamp is the partition key.
a = load 'stored_data' using CustomLoader(); b = filter a by timestamp = CUSTOM_UDF(date, month); 1. Would partitioning work in this case where the partition key filter includes a UDF? 2. Does the partition statement need to be directly after the load statement? What I mean is, if I declare a variable c between a and b which does some other operation on a, will Pig pass the filter expression of b when loading a? 3. Can you point out roughly where this "setPartitionFilter" function is called in Pig code during the load process? I couldn't seem to find it through a search of the Pig source. Thanks a lot!
