Hi, Stan, Foreach is inserted only if you have "as" in "load" statement. This is to assure the data loaded conforms with "as" clause. At some point there is a bug in implementation, this should be fixed in PIG-2346 and will be included in all subsequent releases.
Thanks, Daniel On Fri, Dec 30, 2011 at 9:54 AM, Stan Rosenberg < [email protected]> wrote: > Howdy All, > > I am resurrecting my previous message sent to the list on Dec. 7. Let > me first summarize. In a nutshell, as far as I can tell, > partition-aware loading is broken > in pig, and the culprit is PIG-1188 wherein the final decision was to > introduce project & cast, i.e, foreach, after load. There are two > problems with that approach. > First, as indicated in my original message, 'getPartitionKeys' is > never invoked because instead of the expected instruction sequence > 'load; filter', PIG-1188 > changed it to 'load; foreach; filter'. Second, if a loader already > happens to project & cast in order to adhere the data to the schema, > then the foreach synthesized > by pig is a waste of time. > > Essentially, we had to undo the patch in 'PIG-1188' in order to get > partition filters to work; this enabled us to implement a HiveLoader > very much like > HCatLoader which incidentally is also broken for the very same reason. > This is obviously a hack and a real solution is needed. > If the decision made in PIG-1188 cannot be re-considered, then I > suggest that we revisit the logic which is used to pass partition > filters to partition-aware loaders. > > Many thanks! > > stan > > > > ---------- Forwarded message ---------- > From: Stan Rosenberg <[email protected]> > Date: Wed, Dec 7, 2011 at 12:24 PM > Subject: Partition keys in LoadMetadata is broken in 0.10? > To: [email protected] > > > Hi, > > I am trying to implement a loader which is partition-aware. As > prescribed, my loader implements LoadMetadata, however, > getPartitionKeys is never invoked. > The script is of this form: > > X = LOAD 'input' USING MyLoader(); > X = FILTER X BY partition_col == 'some_string'; > > and the schema returned by MyLoader.getSchema includes the column > 'partition_col' which is of type 'chararray'. > > > After debugging pig, I have found what appears to be a bug in the new > code (version 0.10 snapshot and also in 0.9.1). The reason > MyLoader.getPartitionKeys is never invoked is due to the wrongfully > inserted > 'foreach' after the 'load' and before the 'filter'. The code in > TypeCastInserterTransformer.check used to return 'false' if the > schemas matched or all fields were of type 'bytearray'; cf. pig > version 0.8.1. > Effectively, the above script gets transformed into: > > X = LOAD 'input' USING MyLoader(); > X = FOREACH X GENERATE ...; > X = FILTER X BY partition_col == 'some_string'; > > Subsequently, PartitionFilterPushDownTransformer.check observes that > the immediate successor of 'load' is _not_ 'filter', whence > getPartitionKeys is never invoked. > > Any suggestions? > > Thanks, > > stan > > P.S. While in the above case the 'foreach' can be avoided, in general > typecasting may need to be performed if the user-provided schema does > not match the one returned by the loader. > I think the general case needs to be handled correctly, perhaps by > ignoring all synthetic operators after the 'load'. (This is just a > wild guess.) >
