On the subject of TOP -- the reason you would use it instead of an inner order + limit is that it's much more efficient for large bags. It is algebraic, so the computation can be well optimized. On top of that, it does not require a full sort of the bag.
-D On Thu, Jul 21, 2011 at 9:41 PM, Daniel Dai <da...@hortonworks.com> wrote: > The syntax looks legal. Can you do an explain? > > Daniel > > On Thu, Jul 21, 2011 at 5:15 AM, Andrew Clegg < > andrew.clegg+mah...@gmail.com > > wrote: > > > Hi, > > > > I have some code that looks like this: > > > > top_hits = foreach regrouped { > > result = TOP(1, 6, projected_joined_albums); -- field 6 = score > > generate flatten(result); > > }; > > > > I'm not too keen on the TOP syntax because it's opaque and you need > > the comment there to explain what's going on. > > > > I've seen the same thing achieved like so, in a more transparent way, > > and in fact I've used this in other cases myself: > > > > top_hits = foreach regrouped { > > sorted = order projected_joined_albums by score desc; > > result = limit sorted 1; > > generate flatten(result); > > }; > > > > However, although the first form works for me, the second dies with > > the following error: > > > > java.lang.ClassCastException: java.lang.Integer cannot be cast to > > org.apache.pig.data.Tuple > > at > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:392) > > at > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:276) > > at > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:138) > > at > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:291) > > at > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:355) > > at > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:291) > > at > > > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.runPipeline(PigMapReduce.java:433) > > at > > > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.processOnePackageOutput(PigMapReduce.java:401) > > at > > > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:381) > > at > > > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:251) > > at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176) > > (etc.) > > > > Is there a reason for why it would fail in this case? I can't > > understand the meaning of the error, it'd be nice if it reported > > *which* Tuple was failing a cast. > > > > regrouped has the following schema: > > > > {group: (artistid: int,country: int,week: > > chararray),projected_joined_albums: > > {joined_albums_2::joined_albums_1::flattened_albums::key: (artistid: > > int,country: int,week: > > chararray),joined_albums_2::joined_albums_1::flattened_albums::timestamp: > > long,joined_albums_2::joined_albums_1::flattened_albums::albumid: > > int,track_counts::numtracks: long,joined_albums_2::reach::reach: > > int,joined_albums_2::joined_albums_1::album_titles::title_len: > > long,score: long}} > > > > That's a bit complex so I extracted the individual fields with a > > foreach .. generate beforehand: > > > > {group: (artistid: int,country: int,week: > > chararray),projected_joined_albums: {key: (artistid: int,country: > > int,week: chararray),timestamp: long,albumid: int,numtracks: > > long,reach: int,title_len: long,score: long}} > > > > It didn't affect the error, though. > > > > Thanks for any suggestions, > > > > Andrew. > > > > -- > > > > http://tinyurl.com/andrew-clegg-linkedin | > http://twitter.com/andrew_clegg > > >