Re: Confused by FOREACH .. GENERATE .. TOP semantics

Dmitriy Ryaboy Fri, 22 Jul 2011 05:56:55 -0700

On the subject of TOP -- the reason you would use it instead of an inner
order + limit is that it's much more efficient for large bags.
It is algebraic, so the computation can be well optimized. On top of that,
it does not require a full sort of the bag.


-D

On Thu, Jul 21, 2011 at 9:41 PM, Daniel Dai <da...@hortonworks.com> wrote:

> The syntax looks legal. Can you do an explain?
>
> Daniel
>
> On Thu, Jul 21, 2011 at 5:15 AM, Andrew Clegg <
> andrew.clegg+mah...@gmail.com
> > wrote:
>
> > Hi,
> >
> > I have some code that looks like this:
> >
> > top_hits = foreach regrouped {
> >    result = TOP(1, 6, projected_joined_albums); -- field 6 = score
> >    generate flatten(result);
> > };
> >
> > I'm not too keen on the TOP syntax because it's opaque and you need
> > the comment there to explain what's going on.
> >
> > I've seen the same thing achieved like so, in a more transparent way,
> > and in fact I've used this in other cases myself:
> >
> > top_hits = foreach regrouped {
> >    sorted = order projected_joined_albums by score desc;
> >    result = limit sorted 1;
> >    generate flatten(result);
> > };
> >
> > However, although the first form works for me, the second dies with
> > the following error:
> >
> > java.lang.ClassCastException: java.lang.Integer cannot be cast to
> > org.apache.pig.data.Tuple
> >        at
> >
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:392)
> >        at
> >
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:276)
> >        at
> >
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:138)
> >        at
> >
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:291)
> >        at
> >
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:355)
> >        at
> >
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:291)
> >        at
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.runPipeline(PigMapReduce.java:433)
> >        at
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.processOnePackageOutput(PigMapReduce.java:401)
> >        at
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:381)
> >        at
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:251)
> >        at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
> > (etc.)
> >
> > Is there a reason for why it would fail in this case? I can't
> > understand the meaning of the error, it'd be nice if it reported
> > *which* Tuple was failing a cast.
> >
> > regrouped has the following schema:
> >
> > {group: (artistid: int,country: int,week:
> > chararray),projected_joined_albums:
> > {joined_albums_2::joined_albums_1::flattened_albums::key: (artistid:
> > int,country: int,week:
> > chararray),joined_albums_2::joined_albums_1::flattened_albums::timestamp:
> > long,joined_albums_2::joined_albums_1::flattened_albums::albumid:
> > int,track_counts::numtracks: long,joined_albums_2::reach::reach:
> > int,joined_albums_2::joined_albums_1::album_titles::title_len:
> > long,score: long}}
> >
> > That's a bit complex so I extracted the individual fields with a
> > foreach .. generate beforehand:
> >
> > {group: (artistid: int,country: int,week:
> > chararray),projected_joined_albums: {key: (artistid: int,country:
> > int,week: chararray),timestamp: long,albumid: int,numtracks:
> > long,reach: int,title_len: long,score: long}}
> >
> > It didn't affect the error, though.
> >
> > Thanks for any suggestions,
> >
> > Andrew.
> >
> > --
> >
> > http://tinyurl.com/andrew-clegg-linkedin |
> http://twitter.com/andrew_clegg
> >
>

Re: Confused by FOREACH .. GENERATE .. TOP semantics

Reply via email to