Hi, Stan,
Foreach is inserted only if you have "as" in "load" statement. This is to
assure the data loaded conforms with "as" clause. At some point there is a
bug in implementation, this should be fixed in PIG-2346 and will be
included in all subsequent releases.

Thanks,
Daniel

On Fri, Dec 30, 2011 at 9:54 AM, Stan Rosenberg <
[email protected]> wrote:

> Howdy All,
>
> I am resurrecting my previous message sent to the list on Dec. 7.  Let
> me first summarize.  In a nutshell, as far as I can tell,
> partition-aware loading is broken
> in pig, and the culprit is PIG-1188 wherein the final decision was to
> introduce project & cast, i.e, foreach, after load.  There are two
> problems with that approach.
> First, as indicated in my original message, 'getPartitionKeys' is
> never invoked because instead of the expected instruction sequence
> 'load; filter', PIG-1188
> changed it to 'load; foreach; filter'.  Second, if a loader already
> happens to project & cast in order to adhere the data to the schema,
> then the foreach synthesized
> by pig is a waste of time.
>
> Essentially, we had to undo the patch in 'PIG-1188' in order to get
> partition filters to work; this enabled us to implement a HiveLoader
> very much like
> HCatLoader which incidentally is also broken for the very same reason.
>  This is obviously a hack and a real solution is needed.
> If the decision made in PIG-1188 cannot be re-considered, then I
> suggest that we revisit the logic which is used to pass partition
> filters to partition-aware loaders.
>
> Many thanks!
>
> stan
>
>
>
> ---------- Forwarded message ----------
> From: Stan Rosenberg <[email protected]>
> Date: Wed, Dec 7, 2011 at 12:24 PM
> Subject: Partition keys in LoadMetadata is broken in 0.10?
> To: [email protected]
>
>
> Hi,
>
> I am trying to implement a loader which is partition-aware.  As
> prescribed, my loader implements LoadMetadata, however,
> getPartitionKeys is never invoked.
> The script is of this form:
>
> X = LOAD 'input' USING MyLoader();
> X = FILTER X BY partition_col == 'some_string';
>
> and the schema returned by MyLoader.getSchema includes the column
> 'partition_col' which is of type 'chararray'.
>
>
> After debugging pig, I have found what appears to be a bug in the new
> code (version 0.10 snapshot and also in 0.9.1).  The reason
> MyLoader.getPartitionKeys is never invoked is due to the wrongfully
> inserted
> 'foreach' after the 'load' and before the 'filter'.  The code in
> TypeCastInserterTransformer.check used to return 'false' if the
> schemas matched or all fields were of type 'bytearray'; cf. pig
> version 0.8.1.
> Effectively, the above script gets transformed into:
>
> X = LOAD 'input' USING MyLoader();
> X = FOREACH X GENERATE ...;
> X = FILTER X BY partition_col == 'some_string';
>
> Subsequently, PartitionFilterPushDownTransformer.check observes that
> the immediate successor of 'load' is _not_ 'filter', whence
> getPartitionKeys is never invoked.
>
> Any suggestions?
>
> Thanks,
>
> stan
>
> P.S. While in the above case the 'foreach' can be avoided, in general
> typecasting may need to be performed if the user-provided schema does
> not match the one returned by the loader.
> I think the general case needs to be handled correctly, perhaps by
> ignoring all synthetic operators after the 'load'.  (This is just a
> wild guess.)
>

Reply via email to