Re: SAMPLE after a GROUP BY

Jacob Perkins Tue, 26 Apr 2011 07:44:29 -0700

JIRA filed, see:

https://issues.apache.org/jira/browse/PIG-2014


--jacob
@thedatachef

On Mon, 2011-04-25 at 09:02 -0700, Alan Gates wrote:
> You are not insane.  Pig rewrites sample into filter, and then pushes  
> that filter in front of the group.  It shouldn't push that filter  
> since the UDF is non-deterministic.  If you add "-t PushUpFilter" to  
> your command line when invoking pig this won't happen.  Could you file  
> a JIRA for this so we keep track of it?
> 
> Alan.
> 
> On Apr 24, 2011, at 10:41 AM, Jacob Perkins wrote:
> 
> > So I'm running into something strange. Consider the following code:
> >
> > tfidf_all = LOAD '$TFIDF' AS (doc_id:chararray, token:chararray,
> > weight:double);
> > grouped = GROUP tfidf_all BY doc_id;
> > vectors = FOREACH grouped GENERATE group AS doc_id, tfidf_all.(token,
> > weight) AS vector;
> > DUMP vectors;
> >
> > This, of course, runs just fine. tfidf_all contains 1,428,280 records.
> > The reduce output records should be exactly the number of documents,
> > which turn out to be 18,863 in this case. All well and good.
> >
> > The strangeness comes when I add a SAMPLE command:
> >
> > sampled = SAMPLE vectors 0.0012;
> > DUMP sampled;
> >
> > Running this results in 1,513 reduce output records. So, am I insane  
> > or
> > shouldn't the reduce output records be much much closer to 22 or 23
> > records (eg. 0.0012*18863)?
> >
> > --jacob
> > @thedatachef
> >
>

Re: SAMPLE after a GROUP BY

Reply via email to